Closed zzsfornlp closed 5 years ago
Thanks for the update!
By the way, I've also added some scripts for extracting examples of Affix
Sorry for the overlap. I've done en, ja, id, and zh (w/your affix list). The results are analysis/
.
I'll run my script on Wikipedia corpus tomorrow.
MWE-type
This is nice! Thanks!
data_wiki/outputs
contains ja_ud2.bpe_*.txt
. Is it from BPE trained on the UD corpus?
Yes, the data_*/outputs
contain the outputs with the model trained with the corresponding data in data_*
.
So, data_wiki/outputs/ja_ud2.bpe_*.txt
is actually from BPE trained on wiki, not on UD?
Yes, sorry for the confusing, here is the listing of them:
data_wiki/outputs/ja_ud2.bpe_*.txt
: the UD data with BPE model trained on wiki,
data_wiki/outputs/ja_wikicut.bpe_*.txt
: the WIKI (cutted) data with BPE model trained on wiki,
data_ud2/outputs/ja_ud2.bpe_*.txt
: the UD data with BPE model trained on UD,
(Some of the training is still unfinished, but most of them are done, I will give another update early this evening.)
I understood. Thanks!
Btw the normalized Indonesian Wiki is available at the server: /home/aldrian/Documents/bpe_analysis/morphind/wiki_id.detok.cut.morphnorm
Thanks! Is this file tokenized? Do we directly train BPE on it, or first detokenize it? Btw, where can I find normalized Id UD (GSD+PUD) to apply this norm-BPE on?
For GSD, did you concatenate train+dev+test?
yes, /home/zhisong/data_ud2/id_ud2.conllu
, in fact, this is the concatenation of GSD-train/dev/test+PUD
.
The text still needs to be detokenized.
Okay, I'll detokenize them with the same previously-used Moses tool and then train BPE.
The normalized GSD+PUD is here /home/aldrian/Documents/bpe_analysis/morphind/id_gsd-ud-all_pud.morphnorm
Also, can we train smaller vocab size for Indonesian? It seems that the units are too big with 11961 words.
Sure, updated with a 25% vocab size, please check:
/home/zhisong/data_ud2_norm/outputs/
/home/zhisong/data_ud2/outputs/
I'll start the WIKI-norm training for Id now.
By the way, updated BPE training status: (I've also updated the models and outputs that are available to the server.)
Finished:
*-Wiki: all except Zh-Wiki-90k
Ja-Norm-Wiki: all
Todo:
Zh-Wiki: 90k
Id-Norm-Wiki: 10k 30k 60k 90k
I think another 2 days will be enough.
(Updated):
All training finished:
~/data_ud2/{models,outputs}
: BPE models trained on UD and BPE outputs for UD with these models
~/data_wiki/{models,outputs}
: BPE models trained on WIKI (cut) and BPE outputs for UD/WIKI with these models
~/data_ud2_norm/{models,outputs}
: BPE models trained on normed-UD (ja/id) and BPE outputs for normed-UD with these models
~/data_wiki_norm/{models,outputs}
: BPE models trained on normed-WIKI (ja/id) and BPE outputs for normed-UD/normed-WIKI (ja/id) with these models
Thanks Zhisong! Was it lowercased during training?
No lowercasing for all the training.
Hi, I'm really sorry that the training of BPE on WIKI-data is behind schedule, recently the server that I use is slightly crowded...
Currently there are still several instances remaining to train:
Ja-Wiki: 90k
Zh-Wiki: 10k 30k 60k 90k
Ja-Norm-Wiki: 60k 90k
But I guess these can be finished given another 2 to 3 days.I've uploaded what I've got to the server, here are the structure of the files under the home dir
/home/zhisong/
:data_old/
: previous data, deprecateddata_ud2/
: models trained on merged-UD (GSD/EWT+PUD) and the outputsdata_ud2_norm/
: models trained on normed merged-UDdata_wiki/
: models trained on WIKI-data (full id/zh, 10% en, 50% ja) and the outputs for both WIKI and merged-UDdata_wiki_norm/
: models trained on normed WIKI-data (same sample rate as in WIKI-data) and the outputs for both normed WIKI and normed merged-UDmodels
dir contains the models andoutputs
dir contains the outputs, file-names indicate how they are generated.I'll wait for another day to collect more results to do the Comparison-to-UD analysis. By the way, I've also added some scripts for extracting examples of Affix (here, oh, just saw Naoki's update, I think his script is more efficient than this one) or MWE-type(here), which may be helpful.