singnet / language-learning

OpenCog Unsupervised Language Learning
https://wiki.opencog.org/w/Language_learning
MIT License
32 stars 11 forks source link

Tokenization is different for LG English and LG ANY - which problems may be raised by this #93

Open akolonin opened 5 years ago

akolonin commented 5 years ago

Study why tokenization is different for LG English and LG ANY and which problems may be raised by this and how it could be solved.

Examples from Andres - specifically to apostrophe (') and dash (-) in GC corpus:

MST-parser they'll be here tomorrow 0 LEFT-WALL 1 they'll 1 they'll 2 be 2 be 3 here 2 be 4 tomorrow

LG-English They'll be here tomorrow 0 LEFT-WALL 1 they 1 they 2 'll 2 'll 3 be 3 be 4 here 3 be 5 tomorrow

Because of separation of They'll, all the following words get different positions in the sentence. Even though links like "be here" and "be tomorrow" are correct, they would be considered wrong in the way we are evaluating the parses.

I have detected tokenization problems with contractions including 'll (she'll), 'm (I'm), 'd (he'd), 's (it's). The case with n't (wouldn't) is handled in the same way by both tokenizers, it doesn't separate the word. Dashes are less common but also tokenized different: "time-consuming" as one token vs "time - consuming" as three tokens

It was time-consuming

LG any 0 LEFT-WALL 1 it 1 it 2 was 2 was 3 time 3 time 4 - 4 - 5 consuming

LG English 0 LEFT-WALL 1 it 0 LEFT-WALL 2 was 1 it 2 was 2 was 3 time-consuming

We need to inspect three point: 1) LG ANY use in pre_cleaner 2) LG ANY use in text_parser (MST Parser in OpenCog) 3) LG English use in

Also, need to study how often apostrophe and dash are used inside words in corpora that we use.

glicerico commented 5 years ago

@akolonin , pre_cleaner doesn't use LG to tokenizer.

Significant difference between LG 5.4.3 affix files for "any" and "en".

I'm noticing that in LG 5.5.1, the dash difference is solved, as dash is included as a token splitter in its "en" affix file (it is not included in LG 5.4.3).

Also, LG "en" affix file includes suffix and prefix handling (separating them from word):

% Suffixes 's 're 've 'd 'll 'm ’s ’re ’ve ’d ’ll ’m: SUF+;

% Prefixes % "real" English prefix: y' w/ % Y'gotta Y'gonna % coffee w/milk y' w/: PRE+;

For reference, here's the whole files:


link-grammar-5.4.3/data/any/affix-punc

")" "}" "]" ">" » 〉 ) 〕 》 】 ] 』」 """ "’’" "’" ''.y '.y [0/0]│『 「 、 ` „ ‘ “ '' ' … ... [0/0] "%" "," "." 。 ":" ";" "?" "!" ‽ ؟ ?! ….y ....y "”" │¿ ¡ "$" US$ USD C$ _ - ‐ ‑ ‒ – — ― ~ ━ ー 、 │£ ₤ € ¤ ₳ ฿ ₡ ₢ ₠ ₫ ৳ ƒ ₣ ₲ ₴ ₭ ₺ ℳ ₥ ₦ ₧ ₱ ₰ ₹ ₨ ₪ ﷼ ₸ ₮ ₩ ¥ ៛ 호점 ¢ ₵ ™ ℠ : RPUNC+; │† †† ‡ § ¶ © ® ℗ № "#" │* • ⁂ ❧ ☞ ◊ ※ ○ 。 ゜ ✿ ☆ * ◕ ● ∇ □ ◇ @ ◎ "(" "{" "[" "<" « 〈 ( 〔 《 【 [ 『 「 """ „ “ ‘ ''.x '.x ….x ....x │ ‐ ‑ ‒ – — ― ~ – ━ ー -- - ‧ ¿ ¡ "$" │w/ - ‐ ‑ ‒ – — ― ━ ー ~ │ : LPUNC+; £ ₤ € ¤ ₳ ฿ ₡ ₢ ₠ ₫ ৳ ƒ ₣ ₲ ₴ ₭ ₺ ℳ ₥ ₦ ₧ ₱ ₰ ₹ ₨ ₪ ﷼ ₸ ₮ ₩ ¥ ៛ 호점 │ † †† ‡ § ¶ © ® ℗ № "#": LPUNC+; │% Split words at these. │--: MPUNC+; -- — - _: MPUNC+;


link-grammar-5.4.3/data/en/4.0.affix

% % Affixes get stripped off the left and right side of words % i.e. spaces are inserted between the affix and the word itself. % % Some of the funky UTF-8 parenthesis are used in Asian texts. % 。is an end-of-sentence marker used in Japanese texts.

% Punctuation appearing on the right-side of words. ")" "}" "]" ">" """ » 〉 ) 〕 》 】 ] 』 」 "’’" "’" ” '' ' ` "%" "," ... "." 。 ‧ ":" ";" "?" "!" ‽ ؟ ? ! _ ‐ ‑ ‒ – — ― … ━ – ー ‐ 、 ~ ¢ ₵ ™ ℠ : RPUNC+;

% Punctuation appearing on the left-side of words. % Lots of styles of open-parenthesis % Lots of currency symbols % Paragraph marks % Assorted bullets and dingbats % Dashes of various sorts "(" "{" "[" "<" """ « 〈 ( 〔 《 【 [ 『 「 、 ` `` „ ‘ “ '' ' … ... ¿ ¡ "$" US$ USD C$ £ ₤ € ¤ ₳ ฿ ₡ ₢ ₠ ₫ ৳ ƒ ₣ ₲ ₴ ₭ ₺ ℳ ₥ ₦ ₧ ₱ ₰ ₹ ₨ ₪ ﷼ ₸ ₮ ₩ ¥ ៛ 호점 † †† ‡ § ¶ © ® ℗ № "#"

% Split words at these. --: MPUNC+;

% Suffixes 's 're 've 'd 'll 'm ’s ’re ’ve ’d ’ll ’m: SUF+;

% Prefixes % "real" English prefix: y' w/ % Y'gotta Y'gonna % coffee w/milk y' w/: PRE+;

% The below is a quoted list, used during tokenization. Do NOT put % spaces in between the various quotation marks!! ""«»《》【】『』`„“”": QUOTES+;

% The below is a quoted list, used during tokenization. Do NOT put % spaces in between the various symbols!! "()¿¡†‡§¶©®℗№#*•⁂❧☞◊※○。゜✿☆*◕●∇□◇@◎–━ー---‧": BULLETS+;

/en/words/units.1: UNITS+; /en/words/units.1.dot: UNITS+; /en/words/units.3: UNITS+; /en/words/units.4: UNITS+; /en/words/units.4.dot: UNITS+; /en/words/units.5: UNITS+; % % units.6 contains just a single, sole slash in it. This allows units % such as mL/s to be split at the slash. /en/words/units.6: UNITS+; % /en/words/units.a: UNITS+;

akolonin commented 5 years ago

Temporary solution in #188

glicerico commented 5 years ago

It is natural that LG-any and LG-English have different tokenizations (LG-English may have more supervision). So, as PR https://github.com/singnet/language-learning/issues/188 has made our ULL pipeline not use any tokenization, I think we have solved this issue and shall be closed (@akolonin )

akolonin commented 5 years ago

We did not solve the problem, we just have hidden it with use of LG-English tokenizer for MST-Parsing, let's have this reminder open.

glicerico commented 5 years ago

What problem? LG-any and LG-English are supposed to tokenize differently... that's not even a problem

akolonin commented 5 years ago

Since it contains some useful diagnostics, I would suggest to keep that hanging till we are either able to control tokenization with improved pre-cleaner or implement unsupervised tokenization learning.