Statistics of the Corpus

zzsfornlp commented 5 years ago

Hi, I've collected some statistics of the corpus:

Are we ready to train BPE on UD or WIKI? I think we can fix a series of vocab-sizes for all languages since the type numbers of UD-tokenization are similar. (For WIKI, the number of word types are much larger since the long long tail.)

For example: sth like [5k, 10k, 20k, 30k, 50k] or [4k, 8k, 16k, 32k, 64k] ... ?

notani commented 5 years ago

Thanks!

I think we can fix a series of vocab-sizes for all languages since the type numbers of UD-tokenization are similar.

I would still prefer to set vocabulary size by relative values (%) rather than by absolute values to keep the same condition over all languages and corpora (UDs vs. WIKI). Do you think we should use absolute numbers if possible? (Perhaps because people use absolute numbers in research?)

For Japanese UDs, did you include Modern Japanese? It's from the Corpus of Historical Japanese, which contains texts written 100 years ago. I think we should exclude it because the wording is very different from the other "really modern" corpora.

notani commented 5 years ago

Oops, I closed the issue by mistake. Reopened.

zzsfornlp commented 5 years ago

For Japanese UD, I only use GSD. In fact, I only use one UD treebank for each language: En-EWT, Id-GSD, Ja-GSD, Zh-GAD, here is the UD-Preparing script.

For the vocab size, I'm suggesting fixing sizes since for UD and PUD, they have similar vocab sizes: 5k+ for PUD and 20k+ for UD. Nevertheless, for UD and PUD, setting by relative values is also fine.

However, for Wiki data, I think the vocab size is over-large because of the long tail (for example, symbols, or foreign words), and I guess most of the masses are on the most frequent ones which include the real language information.

For example, here is the 200k lines in the WIKI-En-vocab: (Format: rank || word || count (percentage) || accumulated-count (accumulated-percentage)) I can barely recognize them as English and the first 200K types already cover almost 98% of the tokens. Thus I guess for WIKI data, type-count might not be an accurate enough indicator.

notani commented 5 years ago

For Japanese UD, I only use GSD. In fact, I only use one UD treebank for each language:

OK, got it. I think we can combine GSD and PUD together because zh and id GSD are still small. (Was there an overlap between GSD and PUD?)

However, for Wiki data, I think the vocab size is over-large

Makes sense. I agree to use a fixed size for all languages for Wiki data. (Or just consider words occurring more than once and use the counts as reference vocab sizes.)

In GSD, all languages have similar word types, but the difference between ja and zh is 4k (It's 20% of 20k). I'm not sure if we can ignore this difference in analysis. So how about using relative sizes at least for UD, as we did for PUD?

By the way, perhaps we should always use lowercase forms to reduce uninteresting duplicates (like You and you) in subword vocabularies.

zzsfornlp commented 5 years ago

For Ja-UD, yes, sure we can merge GSD and PUD.

Yes, I see. For UD and PUD, we can use relative vocab sizes. But for WIKI, we need to again assign specific thresholds if we want to obtain reference vocab sizes: for example, the 200Kth type in English occurs 159 times. Thus, how about fixing sizes for WIKI and relative sizes for merged (PUD+UD).

For lowercasing, I'm worrying if this can discard certain information, for example, UN in capital letters is different than un- prefix (also US and us).

notani commented 5 years ago

how about fixing sizes for WIKI and relative sizes for merged (PUD+UD).

Agreed!

For lowercasing, I'm worrying if this can discard certain information, for example, UN in capital letters is different than un- prefix (also US and us).

Good point. Let's keep uppercases and see if it works.

justhalf commented 5 years ago

Agree with the last comments. Keeping uppercase is useful in WIKI since there are much more data now, so hopefully the first letter on each word is not always segmented.

For fixing vocab size, how do we determine the size?

zzsfornlp commented 5 years ago

How about heuristically determining the sizes as in other research works? for example, [4k, 8k, 16k, 32k, 64k]

By the way, updated PUD+UD:

And I've also scp UD related files to the server: /home/zhisong/bpe_data/data_ud/, see Readme in that folder for more information.

Nevertheless, I guess training BPE on UD can be still very fast, but not sure how much time to take on much larger Wiki-data. I think we need to first decide a vocab size and try to run.

(Update): it turns out that 0.25-En is still too large, now down-sample to: En: 0.1, Ja: 0.5, Id: 1.0, Zh: 1.0 ~/data_wiki/wiki_${lang}.detok.cut.txt

notani commented 5 years ago

@zzsfornlp Normalized Japanese wikipedia corpus is here: /home/naoki/data_wiki/wiki_ja.detok.norm.bz2. Can you run BPE on this corpus with the same down-sampling setting?

zzsfornlp commented 5 years ago

Sure! By the way, the server that I use is slightly crowded currently and I can only run one BPE training one time since it takes much memory. Thus, all the training might take some time, but I'll summary what I can get before Tuesday.

justhalf / bpe_analysis

Statistics of the Corpus #1