Discussion for the final report

notani commented 5 years ago

Here are some feedbacks we got in class yesterday.

Chinese and Japanese don't use whitespace, but their characters are logogram. (The unique number of characters is large.) What happens if we transcribe them in Roman alphabets? (Few character types) Or in languages that don't use whitespace but use phonogram? (Maybe Thai?)
- [Naoki] We could use Hiragana/Katakana (50+50 characters) for Japanese.
Indonesian: Is Indonesian high in synthesis? WALS
Why the vocab size maximizing F1 of Chinese and Japanese are smaller than those of English and Indonesian? (Slide 11-14)
- [Naoki] Japanese reference segmentation is morpheme-based, and Chinese words contain a few characters => those reference tokens contain fewer characters than English and Indonesian reference tokens.
Why Chinese 了一 becomes one token?
Core arguments of verbs could affect what verbs (and substrings of verbs) and following tokens are combined by BPE (Slide 28).

My thoughts:

We found some potentially general patterns (e.g. [zh] 了一, [id] prefix + first char of root, [en,ja] verbs + fragments of core arguments). How can we say this is not dataset-dependent?
Treating whitespace as one character doesn't seem a good idea; BPE generates many meaningless multi-token units like tion_to. (This is one of our non-trivial findings, though.)

justhalf commented 5 years ago

We found some potentially general patterns (e.g. [zh] 了一, [id] prefix + first char of root, [en,ja] verbs + fragments of core arguments). How can we say this is not dataset-dependent?

If we follow the BPE sequence of merging analysis, it seems that it is bound to be this way, due to the high frequency of the affix/verbs over the character sequence of the argument. More specifically, on any dataset, we will find affix/verbs + [one character of argument] more than [two characters of argument].

zzsfornlp commented 5 years ago

Thanks a lot!

I am curretnly running UD-based BPE training with finer grained vocab sizes and will update the figure later. For the Ja/Zh pattern, I think that is the nature of the writing system of Non-Roman-Character, remember that Ja/Zh has much much larger character size than En/Id.
For "了一", I think that is a general bigram in Chinese, and we can count and rank the char-bigrams in UD and WIKI (actually I guess this is part of what BPE does).

For the problem of dataset-dependent, I think maybe with WIKI the problem is not that huge, especially for the patterns like "了一" which does not seem to be a domain-specific pattern. I guess as long as we find patterns that we (native speakers) think is general (do we usually meet the pattern much in our speakings or writings)?

justhalf commented 5 years ago

@zzsfornlp When you say "with WIKI the problem is not that huge", which problem are you referring to? Do you mean that in WIKI-Zh most segmentations actually make sense?

Because in Indonesian, even with WIKI, I still see prefix + first letter of root.

zzsfornlp commented 5 years ago

Ah, I mean the problem of "dataset-dependent" patterns. I guess patterns like "了一" might not be only specific to WIKI data.

zzsfornlp commented 5 years ago

Hi, I wonder when do we finnish up the report? Not sure when is the deadline?

justhalf commented 5 years ago

Let's target to have a draft by Wednesday and finish this Friday?

zzsfornlp commented 5 years ago

Sure, the plan sounds good. What analysis do we need to further add?

notani commented 5 years ago

Let's target to have a draft by Wednesday and finish this Friday?

OK! I'll write the affix part tomorrow.

What analysis do we need to further add?

I think we already have plenty of results, but how about trying the BPE of Japanese lemma version in affix and MWE analysis (Sec. 4.3-4.4)? I remember you trained BPE on normalized (lemma) Japanese Wikipedia.

zzsfornlp commented 5 years ago

I think we already have plenty of results, but how about trying the BPE of Japanese lemma version in affix and MWE analysis (Sec. 4.3-4.4)? I remember you trained BPE on normalized (lemma) Japanese Wikipedia.

Sure, I've put the outputs at ~/data_ud2_norm/outputs/ and ~/data_wiki_norm/outputs/, the formats are similar to the previous ones. I guess the MWE analysis will be influenced little since MWEs are mainly on nouns, but I can do similar things as in Sec 4.1.

justhalf commented 5 years ago

We might need to compress the stuff we already have. We are close to 8 pages already now. (but I guess as a linguistics paper it is okay to be longer than normal conference paper?) I just added introduction and a draft for related work.

notani commented 5 years ago

Let's wrap up our report today! Any comments on the current draft?

1. Conclusion and discussion

For me, the answer to our research question is something like:

BPE can capture units that occur with markedly high frequency (statistical idiomaticity) like common affixes, orthographic words, fixed MWEs, and frequent substrings. (Section 4.1,4.2,4.3)
BPE ignore units that do not occur frequently even if they are meaningful to humans. (Section 4.2)

Other interesting implications are:

Human intuition is not optimal in terms of information theory, and this is why BPE ignore some of the linguistically motivated segmentation.
(I'm not very confident, though) BPE takes care of the majority in a given data, which could produce synergy with evaluation practice in NLP where scores are aggregated over datasets.

And future work is:

In our four languages, and boundaries between morphological units are often clear-cut. Perhaps BPE is not effective in languages that have nonlinear fusion.
Are there similar tendencies in unigram LM based subword tokenization (Kudo,'18)?

2. Normalization of Indonesian morphemes

There was no difference in the "Caesar" sentence, but Table 5 shows normalization does increase F1 a bit.

@justhalf: Maybe we can see differences in sentences with more conjugation? I think it would be better to have at least one concrete example to show how normalization changed segmentation.

3. (very minor) Figure 1

@zzsfornlp: Can we use the same y range for each language? It would make it easy to compare figures.

notani commented 5 years ago

4. Examples regarding affixes

I will add some examples to show how affixes are tokenized in section 4.2.

zzsfornlp commented 5 years ago

Thanks a lot! I think the current draft basically looks good and let's wrap up.

Other interesting implications are:

For BPE, I think a common phenomenon of "occurring with markedly high frequency" is the segment of function words which are mostly combined with other words, like case markers in Ja and Zh; but for human, function words may be usally regarded as individual tokens.

(I'm not very confident, though) BPE takes care of the majority in a given data, which could produce synergy with evaluation practice in NLP where scores are aggregated over datasets.

Yes, BPE takes care of the whole dataset and I think that models trained on WIKI give more “stable” outputs. And I guess we can say that it is a "stable" algorithm (although it is a greedy algorithm)? But I'm also not sure whether how this is clearly related to "evalutaion practice".

Are there similar tendencies in unigram LM based subword tokenization (Kudo,'18)?

For this unigram-LM based strategy, I think it is still based on statistical frequency of the segments, which can be similar to BPE's principle.

(very minor) Figure 1

Sure, figures are updated.

notani commented 5 years ago

Thanks for your thoughts!

function words

Good point. We should include this in the discussion section.

"evalutaion practice"

Sorry, It was confusing. I thought of evaluation in downstream tasks, not in segmentation task.

This is what I meant: People usually evaluate NLP methods by metrics like accuracy. We can increase scores in such metrics by simply tackling the majority cases rather than minor cases (this is not always true, though). While BPE often makes stupid decisions on rare words/affixes, it doesn't hurt scores in downstream tasks because they are the minority, and BPE works well in the majority cases.

Figure 1

Thanks for updating it. Looks better.

I also added some examples in section 4.2 on affixes.

justhalf commented 5 years ago

While BPE often makes stupid decisions on rare words/affixes, it doesn't hurt scores in downstream tasks because they are the minority, and BPE works well in the majority cases.

We see this also in the undergrad thesis on Estonian and English.

The downstream task is better when the segmentation is not constrained by morphology. See table 5 in https://pdfs.semanticscholar.org/a4dc/ece000fb5f2ee514f7c625f53355d963592f.pdf. I just added a reference to this also in the Related Work section.

notani commented 5 years ago

Thanks, Aldrian.

I wrote conclusions and discussions. Any addition/modifications are welcome!

justhalf commented 5 years ago

I'm sorry I'm still finishing up my part for my PGM course. Will continue reviewing soon, especially adding examples on how normalization changes Indonesian BPE segmentation.

justhalf commented 5 years ago

Final report at https://v2.overleaf.com/1478222678hvkzpymrkswb And we are done! Thanks all for the great work (especially Zhisong for running lots of experiments and lots of writing). Thanks Naoki for getting through the idea and the direction of research!

justhalf / bpe_analysis