Grammar tester reports: "Number of sentences in corpus and reference files mismatch"

singnet / language-learning

OpenCog Unsupervised Language Learning

https://wiki.opencog.org/w/Language_learning

MIT License

32 stars 11 forks source link

Grammar tester reports: "Number of sentences in corpus and reference files mismatch" #161

Open alexei-gl opened 5 years ago

alexei-gl commented 5 years ago

Grammar tester reports: "Number of sentences in corpus and reference files missmatch" when dictionary was generated for wrong Link Grammar version. This happends when link-parser is unable to find words in supplied dictionary and UNKNOWN-WORD rule contained in the dictionary is written with or without '<>' (depending on LG version) which makes the rule worthless and produces non fatal link-parser errors. Some sentences are left unparsed because of the unknown words, so the number of parsed sentences does not match the number of sentences in reference parses.

alexei-gl commented 5 years ago

Dictionary version check added to LG-based parser. If version of LG and dictionary one tries to use for parsing mismatch, exception is generated. In grammar learner code LG version check is also restored.

OlegBaskov commented 5 years ago

Still happens in a situation where dictionary version corruption is unlikely. In a sequence of 5 tests with same settings 4 tests pass, while the 5th crashes. The corpus is extracted from reference file in all the 5 tests.

Static html copy of the Jupyter notebook -- GCB-LG-E-clean-ALE-MWC=1-MSL=10-2019-02-17_LGParseError.html, error "LGParseError: Number of sentences in corpus and reference files missmatch. Reference file '/home/obaskov/94/language-learning/data/GCB/LG-E-clean/GCB-LG-English-clean.ull' does not match its corpus counterpart 104341 != 104340" in cell 15.

The faulty grammar directory -- GCB-LG-E-clean-ALE-MWC=1-MSL=10-2019-02-17_LGParseError/GCB_LG-E-clean_cALWEd_no-gen_20c/

OlegBaskov commented 5 years ago

Another sample -- GCB-LG-E-clean-MWC=D1-MSL=5-2019-02-17_LGParseError.ipynb, the same cell 15 with 20 clusters test.

Static html copy of the notebook -- GCB-LG-E-clean-ALE-MWC=1-MSL=5-2019-02-17_LGParseError.html, link to data in the 1st cell of the notebook.

alexei-gl commented 5 years ago

@OlegBaskov I made a new PR with fixes. Please, update your code base by typing git pull.

akolonin commented 5 years ago

@alexei-gl - please test the issue with the Oleg's configuration reported and close the issue if it works

akolonin commented 5 years ago

@alexei-gl - can we close this now?