Pre-cleaner issues - Githubissues

akolonin commented 5 years ago

Fix multiple pre-cleaner issues, based on: http://langlearn.singularitynet.io/data/akolonin_studies/gutenberg_children_cleaned_wordcounts.txt Specific problems:

@alexei-gl : "up.'and" token in parse instead of "up" which was the right word in original corpus sentence. I ran though grammar rules in 4.0.dict file, induced by grammar learner, and found token "up.'and" in one of the grammar rules. Apparently Link Grammar interpret it as token "up" with suffix "'and" which leads to error in parsing when parsed by link-parser testing the grammar because "up.'and" and "up" are located in different gramma rules (clusters). The same goes for "..y" which is interpreted by link-parser as token "." with suffix "y".
Many situation when unparsed periods (dots) are hanging past words - for example: assume. 1 associates. 1

glicerico commented 5 years ago

@akolonin The aforementioned examples are not problems with the pre-cleaner: 1a) "up.'and" is actually a typo in the corpus where a space is missing in the original sentence. The pre-cleaner doesn't split dots and apostrophes inside words. 1b) "..y" (and it's variants with more dots) is actually coming from a bug in our link-parser wrapper, which I remember that @alexei-gl was aware of. The wrapper failed to remove the LG tag for a single dot.

glicerico commented 5 years ago

For 2), I see this word count is coming from june 2018. Now that we're using the LG tokenization, these counts should not appear anymore.

akolonin commented 5 years ago

We can have any situations in any kinds of corpus for any kinds of languages other than English and pre-cleaner should be configurable and stable enough to handle all that. Using LG-English tokenization can not be thought as a solution for unsupervised learning project even for English, leave alone all other 6,500 languages on Earth.

akolonin commented 5 years ago

It is clearly fine to use LG-English tokentization for current iteration as a temporary hack but we will have to go back to the pre-cleaner and/or unsupervised tokenization at some point.

glicerico commented 5 years ago

@akolonin , I agree with you that unsupervised pre-cleaning and tokenization would be needed in the long term. However, none of the problems described in this issue should be handled by the pre-cleaner. What actions would be needed to close this issue?

akolonin commented 5 years ago

1) "up.'and" - let's ignore this issue for now

2) "associates." - fix pre-cleaner so it never keeps period in the end of sentence attached to word - the ending period past word should be treated as separate token.

3) it should be possible to configure pre-cleaner to have periods dropped as skip-word (but the latter option should be off by default).

P.S. Need to keep in mind that period attached to word in the end of the sentence and separate period in the end of the sentence may be treated differently by Link Parser based on rules in the affix file.

glicerico commented 5 years ago

@akolonin , 1) Agree, pre-cleaner cannot handle all possible scenarios. A possible and complex solution to this is the dynamic tokenizer that Linas has been talking about in Link Grammar. 2) The pre-cleaner was never meant to be a tokenizer. If you look at a pre-cleaned file (e.g. http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower/11-0.txt) you'll see that all sentences that had a final dot are keeping it in place. On the other hand, the tokenizer is in charge of splitting that final dot; if you look at the tokenized files (e.g. http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_tokenized_LG5.5.1/11-0.txt.ull) you see the final dot is separated. The file that you refer in your original point 2) in this issue is a cleaned file, not a tokenized file, and that's why your word counts includes things like "associates." 3) That option already exists. The pre-cleaner can eliminate single characters if asked to: https://github.com/singnet/language-learning/blob/92ca520a97d1f0c48bf644eb24b062307ae55e96/src/pre_cleaner/pre_cleaner.py#L23 P.S.) I have that in mind, so that's why we decided that we DON'T want to use the LG tokenized in our pipeline, and I removed its use from our pipeline, deferring the tokenizing step to some other algorithm (in practice we're now using LG-English, but this should be something unsupervised in the future, as you say).

akolonin commented 5 years ago

@glicerico re: 2) I see, but all these tokens with literals ending with period make no sense - can we just have periods separated from literals like we do for any other punctuation - commas, question and exclamation marks? Looks like linkparser should be happy about that:

linkparser> he got behind alice as he spoke.
Found 4 linkages (4 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.04 LEN=10)

    +--------------------------Xp--------------------------+
    +---->WV--->+--------Pa--------+       +----CV-->+     |
    +->Wd--+-Ss-+---MVp--+         +--MVs--+-Cs+--Ss-+     |
    |      |    |        |         |       |   |     |     |
LEFT-WALL he got.v-d behind.p alice[?].a as.e he spoke.v-d .

Press RETURN for the next linkage.
linkparser> he got behind alice as he spoke .
Found 4 linkages (4 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.04 LEN=10)

    +--------------------------Xp--------------------------+
    +---->WV--->+--------Pa--------+       +----CV-->+     |
    +->Wd--+-Ss-+---MVp--+         +--MVs--+-Cs+--Ss-+     |
    |      |    |        |         |       |   |     |     |
LEFT-WALL he got.v-d behind.p alice[?].a as.e he spoke.v-d .

Press RETURN for the next linkage.

akolonin commented 5 years ago

It does not make sense to create two words from one just because the word has turned unhappy to stay in the very end of the sentence: akolonin@Ubuntu-1604-xenial-64-minimal:~/public/akolonin_studies$ grep ^it gutenberg_children_cleaned_wordcounts.txt it 30863 it. 4433 ... italy 5 italy. 5

akolonin commented 5 years ago

@glicerico if you fix that, please place re-generate GC "cleaned" corpus and put it aside for future work

glicerico commented 5 years ago

@akolonin , I'll fix this. But just for the record:

It does not make sense to create two words from one just because the word has turned unhappy to stay in the very end of the sentence:

there is no creation of two words other than in that wordcount file, because once the tokenizer processes the text, the dot gets separated... the "two words" don't make it to the word space vector.

akolonin commented 5 years ago

@glicerico

because once the tokenizer processes the text

This happens in LG-Parser and OpenCog MI-counter but not in DNN-MI-lker, right?

akolonin commented 5 years ago

@glicerico - I have found this in the spec: https://docs.google.com/document/d/1xgO6CREtXzYG8k1_8-ixUNUEEnmOTsYX1tWp7gEscc0/edit# 9. Sentence breakers: “.”, “!”, “?”, and double brackets used for dialogues. Breaking symbols adjacent (heading or trailing) to tokens are used to create individual tokens on each appearance of symbol. 10. Punctuation symbols treated as individual tokens on each appearance of symbol: all brackets, parentheses and braces, comma, colon, semicolon, both slashes, currency signs, at, hash, ampersand, math and logic symbols.

So it looks like it all has been anticipated and all that we need is to have this implemented and have the period placed in the respective configuration list. Having unit-test on the spec won't hurt either, if seems reasonable and have time. Also, I suggest your give @alexei-gl and me training on pre-cleaner before you are gone.

glicerico commented 5 years ago

This happens in LG-Parser and OpenCog MI-counter but not in DNN-MI-lker, right?

It also happens in DNN version, because we feed it pre-tokenized text

glicerico commented 5 years ago

@akolonin , I worked on pre-cleaner update and unit tests. Default is now to separate dots from end of words, as requested. Changes in PR# https://github.com/singnet/language-learning/pull/230 Also, newly cleaned corpus, including some other fixes done in pre-cleaner, can be found in the ULL server in /home/asuarez/CORPORA/English/Children/Gutenberg_Children_Books/cleaned_18-06-2019 I will do a couple more minor fixes to pre-cleaner and then show you and @alexei-gl on how to use it.

akolonin commented 5 years ago

@glicerico - can you turn the pg26041.txt_headless_split_default to pg26041.txt so we have *.txt files ?

glicerico commented 5 years ago

No problem, so we want no description of the pre-processing in the file names at all, right?

On Tue, Jun 18, 2019, 22:22 Anton Kolonin notifications@github.com wrote:

@glicerico https://github.com/glicerico - can you turn the pg26041.txt_headless_split_default to pg26041.txt so we have *.txt files ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/singnet/language-learning/issues/187?email_source=notifications&email_token=AFTKIOWE7MNSMNEXHRV77HDP3DVRDA5CNFSM4G7OHFGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX6ZJSI#issuecomment-503157961, or mute the thread https://github.com/notifications/unsubscribe-auth/AFTKIOU2NPC7E5FMP4YKP23P3DVRDANCNFSM4G7OHFGA .

akolonin commented 5 years ago

No problem, so we want no description of the pre-processing in the file names at all, right?

I am not sure what the "_headless_split_default" says and it makes the files hardly handleable by hands and I guess the following pipeline. What does such extension buy us and why can't we place details in the readme?

Other than that - have looked into maximum sentence length (MSL) of sentences:

$ cat * | awk 'NF' | sort | awk '{print NF}' | sort | uniq -c | sort -k 2,2 -n 898 1 2078 2 3908 3 5033 4 6864 5 9032 6 10090 7 10982 8 11473 9 11484 10 11476 11 10904 12 10583 13 10375 14 9874 15 9360 16 9019 17 8667 18 8266 19 7878 20 7482 21 6984 22 6661 23 6403 24 6063 25

Based on that, to enable further "curriculum learning" work, can you create one more subfolder level and A) place current content (MSL=25) into subfolder called msl25 B) create subfolders msl=5,10,15,20 and fill them with respective content ?

glicerico commented 5 years ago

@akolonin the precleaner is now modified to avoid adding suffixes to filenames. Instead, user is responsible to name the directories with sufficient description. Files are in /home/asuarez/CORPORA/English/Children/Gutenberg_Children_Books/cleaned/ The suffixes were initially added to keep track of the pre-processing that was done to the original Project Gutenberg files, like removing header and footers (e.g. the terms of use of Project Gutenberg's books), splitting the file per sentence, as well as parameters used in pre-cleaner. Now, I added the cleaning date to the folder names, because that's a way of knowing what were the precleaner's default parameters valid at the moment (you see, we have changed those a few times during the project already).

akolonin commented 5 years ago

@glicerico @alexei-gl - having the parameters footprint in folder name makes much more sense than having it in every file name. However, the pipeline idea and style is to let the people and pipeline itself to handle folder naming based on parameters, not giving this right to the every individual component.

glicerico commented 5 years ago

That makes sense, except that pre-cleaner is not part of the pipeline yet. Also, I find it hard to imagine pre-cleaner could be completely automated, since each different corpus may have really different characteristics. I've had the idea that pre-cleaner should be done manually before feeding to the pipeline.

On Wed, Jun 19, 2019, 21:16 Anton Kolonin notifications@github.com wrote:

@glicerico https://github.com/glicerico @alexei-gl https://github.com/alexei-gl - having the parameters footprint in folder name makes much more sense than having it it every file name. However, the pipeline idea and style is to let the people and pipeline itself to handle folder naming based on parameters, not giving this right to the every individual component.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/singnet/language-learning/issues/187?email_source=notifications&email_token=AFTKIOVAGV2MIPYCQXNLU53P3IWRTA5CNFSM4G7OHFGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYB2OZY#issuecomment-503555943, or mute the thread https://github.com/notifications/unsubscribe-auth/AFTKIOSRGNDF3OSEZAJDY3TP3IWRTANCNFSM4G7OHFGA .

akolonin commented 5 years ago

The plan is to have part of pipeline and settings to be parameters. We are on the long way :-)

glicerico commented 5 years ago

Can we close this issue?

On Wed, Jun 19, 2019, 21:37 Anton Kolonin notifications@github.com wrote:

The plan is to have part of pipeline and settings to be parameters. We are on the long way :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/singnet/language-learning/issues/187?email_source=notifications&email_token=AFTKIORJHN7NXO7POBEPKADP3IZBZA5CNFSM4G7OHFGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYB4PMY#issuecomment-503564211, or mute the thread https://github.com/notifications/unsubscribe-auth/AFTKIOQ3UGFHW7PAQTFXOSLP3IZBZANCNFSM4G7OHFGA .

singnet / language-learning

Pre-cleaner issues #187