Open Saltychtao opened 3 years ago
Sorry, I've found the scripts.
Thank you for your attention to our work!
Hello, thanks for your help! I am running the data processing command following the instruction in the README:
./scripts/dataset/opus_koehn/setup_dataset.sh
# Train sentencepiece for each domain.
./setup_sentencepiece.sh opus_it_sp16000.outD.all translation
./setup_sentencepiece.sh opus_acquis_sp16000.inD.all translation
# Train CBoW vectors for each domain.
./train_cbow.sh opus_it_sp16000.outD.all translation
./train_cbow.sh opus_acquis_sp16000.inD.all translation
# Binarize the datasets for fairseq.
./preprocess.sh opus_it_sp16000.outD.all translation
./preprocess.sh opus_acquis_sp16000.inD.100k translation
and got an error at the last command:
Running './preprocess.sh opus_acquis_sp16000.inD.100k translation'...
Traceback (most recent call last):
File "scripts/random_pickup.py", line 52, in <module>
main(args)
File "scripts/random_pickup.py", line 5, in main
src = [l for l in open(args.src_file)]
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/koehn17six/processed/acquis/sp/v_opus_acquis_sp16000_100k/train.de'
should I copy the corresponding file from v_opus_acquis_sp16000_all
or running ./setup_sentencepiece.sh opus_acquis_sp16000.inD.100k translation
(changing all to 100k) before running this command?
Although the OPUS translation corpus [Koehn, WMT'17] is a famous dataset for NMT domain adaptation, surprisingly, the author's page for download is missing now in my understanding. I used one I previously obtained from other researchers. So in this repository, I could not prepare a complete script for the preprocessing of the De-En setting and I only described an instruction for En-Ja translation from JESC to ASPEC for the time being (the script should have raised an error when there is no data on "dataset/koehn17six").
I'm going to upload the De-En translation data we used and update README and the scripts. I will leave a message here, please wait for a while.
Thanks for your response! In fact, I have downloaded the OPUS De-En dataset from https://github.com/JunjieHu/dali, which is the repo of Hu et al. 2019, and put it on 'dataset/koehn17six', so I guess the error may come from the script "preprocess.sh".
To make it clear, the sentencepiece model should be trained on the full corpus (700k for acquis) for both 100k and 700k settings, right? If so, I can just copy the missing "train.de" file from the directory v_opus_acquis_sp16000_all
, and then randomly pick up 100k samples using the random_pick.py
script to select a subset.
No. I'm sorry, I now find there are some misleading descriptions as I wrote README.md before submitting the camera-ready copy... I will rewrite it tomorrow.
./setup_sentencepiece.sh opus_acquis_sp16000.inD.100k translation
./train_cbow.sh opus_acquis_sp16000.inD.100k translation
./preprocess.sh opus_acquis_sp16000.inD.100k translation
are the correct commands for the in-domain preprocessing (w/ 100k parallel data, no monolingual data).
As the paper describes, the number of data for training Sentencepiece and word2vec depends on the setting. To prepare in-domain Sentencepiece models (and word2vec) for OPUS-Acquis, we used
This is because the quality of in-domain subword tokenization is also important and we aimed to see how it changed depending on the size of available in-domain resources.
FYI, in fact, just running ./train.sh $model_id translation can automatically run all preprocessing except downloading. For example,
# [Out-domain, in Table 3]
./train.sh opus_it_sp16000.outD.all translation
# [In-domain, in Table 3, 100k]
./train.sh opus_acquis_sp16000.inD.100k translation
# [VA-LLM, in Table 3, 100k] (need to train Out-domain)
./train.sh opus_it_sp16000@opus_acquis_sp16000.va.v_opus_acquis_sp16000_100k.llm-idt.nn10.100k translation
Thanks for your detailed response! I have started the training process successfully.
Hello, sorry to bother you, but I met another error. I have successfully trained the opus_it_sp16000.outD.all
model and opus_it_sp16000@opus_acquis_sp16000.va.v_opus_acquis_sp16000_100k.llm-idt.nn10.100k
model, but when I ran the ./generate_many.sh opus_it_sp opus_law_sp translation
command as instructed by the README.md, I've got the following error
(vocab_adpt) lijh@3090:~/vocabulary_adaptation$ CUDA_VISIBLE_DEVICES=2 bash ./generate_many.sh opus_it_sp opus_acquis_sp translation
Running './setup_sentencepiece.sh opus_it_sp16000@opus_acquis_sp16000.noadapt.all translation'...
Running './preprocess.sh opus_it_sp16000@opus_acquis_sp16000.noadapt.all translation'...
Creating binary files with fairseq format to 'dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/fairseq.all'...
Namespace(alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/fairseq.all', extra_features={}, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='de', srcdict='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/dict.de.txt', target_lang='en', task='translation', tbmf_wrapper=False, tensorboard_logdir='', testpref='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/test', tgtdict='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/dict.en.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/train.all', user_dir=None, validpref='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/dev', workers=8)
Traceback (most recent call last):
File "fairseq/preprocess.py", line 293, in <module>
cli_main()
File "fairseq/preprocess.py", line 289, in cli_main
main(args)
File "fairseq/preprocess.py", line 77, in main
src_dict = task.load_dictionary(args.srcdict)
File "/home/lijh/vocabulary_adaptation/fairseq/fairseq/tasks/fairseq_task.py", line 35, in load_dictionary
return Dictionary.load(filename)
File "/home/lijh/vocabulary_adaptation/fairseq/fairseq/data/dictionary.py", line 185, in load
d.add_from_file(f, ignore_utf_errors)
File "/home/lijh/vocabulary_adaptation/fairseq/fairseq/data/dictionary.py", line 202, in add_from_file
raise fnfe
File "/home/lijh/vocabulary_adaptation/fairseq/fairseq/data/dictionary.py", line 196, in add_from_file
with open(f, 'r', encoding='utf-8') as fd:
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/dict.de.txt'
Do I need to train all the models, i.e. outD, ft. va, before running the generate_many.sh
? Thanks in advance.
No, you don't need training all models. This error is due to the evaluation scripts when preparing only several settings:
This was caused by the failure of running Out-domain
setting in generate_many.sh
. If you want to test only the VA-LLM
setting, ./generate.sh opus_it_sp16000@opus_acquis_sp16000.va.v_opus_acquis_sp16000_100k.llm-idt.nn10.100k translation
will work and make results to ${model_root}/${model_name}/tests/${tgt_domain}.outputs
.
(You might not be interested in) To be more specific, the message shows that vocabulary lists (dict.en.txt
and dict.de.txt
, prepared in train_cbow.sh
) for the Out-domain model were missing in dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all
; this is a directory for in-domain data encoded by out-domain subword tokenizations. As I now added a few lines to generate_many.sh
and train_cbow.sh
for automatically preparing them, please retry generate_many.sh
or manually copy the vocabulary files from dataset/koehn17six/processed/it/sp/v_opus_it_sp16000_all
if you want to evaluate the Out-domain setting.
The lack of explanations and preparations for reproduction is my fault, I appreciate your report.
Hello, thanks for your great work! Can you provide a detailed script to illustrate the way to prepare the translation dataset?