singnet / language-learning

OpenCog Unsupervised Language Learning
https://wiki.opencog.org/w/Language_learning
MIT License
32 stars 11 forks source link

Using ats (@) and periods (.) for suffixes in Pre-Cleaner, MST-Parser, Grammar Learner and Link Grammar #191

Open akolonin opened 5 years ago

akolonin commented 5 years ago

Few problems:

  1. During iterative grammar learning, tagging words in input corpus and input parses may face ambiguity if the words with ats (@) in parses and corpus are translated to words with Link Grammar (LG) suffixes using period (.).
  2. Emails with inner ats (@) are "corrupted" after Grammar Learner (GL) with changing the ats to periods (.).
  3. There is some problem (TO BE EXPLAINED WITH DATA REFERENCES by @alexei-gl ) about up.'and и up@'and in Grammar Tester (GT).
  4. When the input corpus/parses contain words with ats they are not recognised by GT (because they are stored with periods) which decrease F1 metric.

@glicerico , do you think that using the period (.) in WSD process and nod re-coding periods to ats by GL could eliminate all of the the problems and wouldn't solve other problems in MST-Parsing?

alexei-gl commented 5 years ago

Item 3 sample is located at http://langlearn.singularitynet.io/data/aglushchenko_parses/suffix-problem/ . The above mentioned token can be easily found in the dictionary file rule.

akolonin commented 5 years ago

Looks like the problem with up.'and и up@'and is not the akolonin@Ubuntu-1604-xenial-64-minimal:/home/aglushchenko/data/parses/suffix-problem$ grep -P ".\'" dict_20C_2019-01-28_0006.4.0.dict | wc -l 1 akolonin@Ubuntu-1604-xenial-64-minimal:/home/aglushchenko/data/parses/suffix-problem$ grep -P "up.\'and" dict_20C_2019-01-28_0006.4.0.dict | wc -l 1 grep -P "up.\'and" test-corpus-06.txt.raw (dove)(,)(and)(flew)(up.'and)(into)(the)(air)(.)] grep -P ".\'" test-corpus-06.txt.raw (dove)(,)(and)(flew)(up.'and)(into)(the)(air)(.)] grep -P ".\w" test-corpus-06.txt.raw | grep -v Found | grep -v Link(the)(man)(at)(the)(other)(end)(of)(them)(..y)] (as)(her)(..y)]

@glicerico - in the version MST-parsed that you are crafting now, can we have MST-Parser configured so it is not breaking words with inner period?

glicerico commented 5 years ago

@akolonin , the new tokenizer-less version of the observer and MST-parser only splits by spaces, so this should not be a problem.