justhalf / bpe_analysis

Analysis of BPE on four languages: English, Indonesian, Chinese, Japanese
0 stars 1 forks source link

BPE training status on WIKI-data #3

Closed zzsfornlp closed 5 years ago

zzsfornlp commented 5 years ago

Hi, I'm really sorry that the training of BPE on WIKI-data is behind schedule, recently the server that I use is slightly crowded...

Currently there are still several instances remaining to train: Ja-Wiki: 90k Zh-Wiki: 10k 30k 60k 90k Ja-Norm-Wiki: 60k 90k But I guess these can be finished given another 2 to 3 days.

I've uploaded what I've got to the server, here are the structure of the files under the home dir /home/zhisong/: data_old/: previous data, deprecated data_ud2/: models trained on merged-UD (GSD/EWT+PUD) and the outputs data_ud2_norm/: models trained on normed merged-UD data_wiki/: models trained on WIKI-data (full id/zh, 10% en, 50% ja) and the outputs for both WIKI and merged-UD data_wiki_norm/: models trained on normed WIKI-data (same sample rate as in WIKI-data) and the outputs for both normed WIKI and normed merged-UD

I'll wait for another day to collect more results to do the Comparison-to-UD analysis. By the way, I've also added some scripts for extracting examples of Affix (here, oh, just saw Naoki's update, I think his script is more efficient than this one) or MWE-type(here), which may be helpful.

notani commented 5 years ago

Thanks for the update!

By the way, I've also added some scripts for extracting examples of Affix

Sorry for the overlap. I've done en, ja, id, and zh (w/your affix list). The results are analysis/. I'll run my script on Wikipedia corpus tomorrow.

MWE-type

This is nice! Thanks!

notani commented 5 years ago

data_wiki/outputs contains ja_ud2.bpe_*.txt. Is it from BPE trained on the UD corpus?

zzsfornlp commented 5 years ago

Yes, the data_*/outputs contain the outputs with the model trained with the corresponding data in data_*.

notani commented 5 years ago

So, data_wiki/outputs/ja_ud2.bpe_*.txt is actually from BPE trained on wiki, not on UD?

zzsfornlp commented 5 years ago

Yes, sorry for the confusing, here is the listing of them:

data_wiki/outputs/ja_ud2.bpe_*.txt: the UD data with BPE model trained on wiki, data_wiki/outputs/ja_wikicut.bpe_*.txt: the WIKI (cutted) data with BPE model trained on wiki, data_ud2/outputs/ja_ud2.bpe_*.txt: the UD data with BPE model trained on UD,

(Some of the training is still unfinished, but most of them are done, I will give another update early this evening.)

notani commented 5 years ago

I understood. Thanks!

justhalf commented 5 years ago

Btw the normalized Indonesian Wiki is available at the server: /home/aldrian/Documents/bpe_analysis/morphind/wiki_id.detok.cut.morphnorm

zzsfornlp commented 5 years ago

Thanks! Is this file tokenized? Do we directly train BPE on it, or first detokenize it? Btw, where can I find normalized Id UD (GSD+PUD) to apply this norm-BPE on?

justhalf commented 5 years ago

For GSD, did you concatenate train+dev+test?

zzsfornlp commented 5 years ago

yes, /home/zhisong/data_ud2/id_ud2.conllu, in fact, this is the concatenation of GSD-train/dev/test+PUD.

justhalf commented 5 years ago

The text still needs to be detokenized.

zzsfornlp commented 5 years ago

Okay, I'll detokenize them with the same previously-used Moses tool and then train BPE.

justhalf commented 5 years ago

The normalized GSD+PUD is here /home/aldrian/Documents/bpe_analysis/morphind/id_gsd-ud-all_pud.morphnorm

justhalf commented 5 years ago

Also, can we train smaller vocab size for Indonesian? It seems that the units are too big with 11961 words.

zzsfornlp commented 5 years ago

Sure, updated with a 25% vocab size, please check: /home/zhisong/data_ud2_norm/outputs/ /home/zhisong/data_ud2/outputs/

I'll start the WIKI-norm training for Id now.

zzsfornlp commented 5 years ago

By the way, updated BPE training status: (I've also updated the models and outputs that are available to the server.)

Finished: *-Wiki: all except Zh-Wiki-90k Ja-Norm-Wiki: all

Todo: Zh-Wiki: 90k Id-Norm-Wiki: 10k 30k 60k 90k I think another 2 days will be enough.

zzsfornlp commented 5 years ago

(Updated):

All training finished: ~/data_ud2/{models,outputs}: BPE models trained on UD and BPE outputs for UD with these models ~/data_wiki/{models,outputs}: BPE models trained on WIKI (cut) and BPE outputs for UD/WIKI with these models ~/data_ud2_norm/{models,outputs}: BPE models trained on normed-UD (ja/id) and BPE outputs for normed-UD with these models ~/data_wiki_norm/{models,outputs}: BPE models trained on normed-WIKI (ja/id) and BPE outputs for normed-UD/normed-WIKI (ja/id) with these models

justhalf commented 5 years ago

Thanks Zhisong! Was it lowercased during training?

zzsfornlp commented 5 years ago

No lowercasing for all the training.