singnet / language-learning

OpenCog Unsupervised Language Learning
https://wiki.opencog.org/w/Language_learning
MIT License
32 stars 11 forks source link

Fix pre-cleaner to avoid leaving blank lines for skipped sentences on MSL filter #238

Closed akolonin closed 5 years ago

akolonin commented 5 years ago

After fixing, need to regenerate the data in http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/MSL5-25-2019JUN19/

Examples of files with blank lines: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/MSL5-25-2019JUN19/cleaned-MSL5-2019JUN19/11-0.txt http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/MSL5-25-2019JUN19/cleaned-MSL10-2019JUN19/11-0.txt http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/MSL5-25-2019JUN19/cleaned-MSL10-2019JUN19/12-0.txt

glicerico commented 5 years ago

Issue caused by lines that only contain characters that are removed by pre-cleaner. E.g. " "

glicerico commented 5 years ago

Fixed in https://github.com/singnet/language-learning/pull/239

glicerico commented 5 years ago

Data regenerated and uploaded to http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/