Scripts to prepare the dataset

Saltychtao commented 3 years ago

Hello, thanks for your great work! Can you provide a detailed script to illustrate the way to prepare the translation dataset?

Saltychtao commented 3 years ago

Sorry, I've found the scripts.

jack-and-rozz commented 3 years ago

Thank you for your attention to our work!

Among the setup steps, pure (or independent) preprocessing (download, tokenization, truecasing) is done only in setup_dataset.sh. Other scripts (setup_sentencepiece.sh, train_cbow.sh) are tied to our code and some detailed experimental settings defined in const.sh (number of vocabularies, etc.)
I now added our pre-trained English truecaser for convenience.

Saltychtao commented 3 years ago

Hello, thanks for your help! I am running the data processing command following the instruction in the README:

 ./scripts/dataset/opus_koehn/setup_dataset.sh

 # Train sentencepiece for each domain.
 ./setup_sentencepiece.sh opus_it_sp16000.outD.all translation
 ./setup_sentencepiece.sh opus_acquis_sp16000.inD.all translation

 # Train CBoW vectors for each domain.
 ./train_cbow.sh opus_it_sp16000.outD.all translation
 ./train_cbow.sh opus_acquis_sp16000.inD.all translation  

 # Binarize the datasets for fairseq.
 ./preprocess.sh opus_it_sp16000.outD.all translation
 ./preprocess.sh opus_acquis_sp16000.inD.100k translation

and got an error at the last command:

Running './preprocess.sh opus_acquis_sp16000.inD.100k translation'...
Traceback (most recent call last):
  File "scripts/random_pickup.py", line 52, in <module>
    main(args)
  File "scripts/random_pickup.py", line 5, in main
    src = [l for l in open(args.src_file)]
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/koehn17six/processed/acquis/sp/v_opus_acquis_sp16000_100k/train.de'

should I copy the corresponding file from v_opus_acquis_sp16000_all or running ./setup_sentencepiece.sh opus_acquis_sp16000.inD.100k translation (changing all to 100k) before running this command?

jack-and-rozz commented 3 years ago

Although the OPUS translation corpus [Koehn, WMT'17] is a famous dataset for NMT domain adaptation, surprisingly, the author's page for download is missing now in my understanding. I used one I previously obtained from other researchers. So in this repository, I could not prepare a complete script for the preprocessing of the De-En setting and I only described an instruction for En-Ja translation from JESC to ASPEC for the time being (the script should have raised an error when there is no data on "dataset/koehn17six").

I'm going to upload the De-En translation data we used and update README and the scripts. I will leave a message here, please wait for a while.

Saltychtao commented 3 years ago

Thanks for your response! In fact, I have downloaded the OPUS De-En dataset from https://github.com/JunjieHu/dali, which is the repo of Hu et al. 2019, and put it on 'dataset/koehn17six', so I guess the error may come from the script "preprocess.sh".

To make it clear, the sentencepiece model should be trained on the full corpus (700k for acquis) for both 100k and 700k settings, right? If so, I can just copy the missing "train.de" file from the directory v_opus_acquis_sp16000_all, and then randomly pick up 100k samples using the random_pick.py script to select a subset.

jack-and-rozz commented 3 years ago

No. I'm sorry, I now find there are some misleading descriptions as I wrote README.md before submitting the camera-ready copy... I will rewrite it tomorrow.

./setup_sentencepiece.sh opus_acquis_sp16000.inD.100k translation
./train_cbow.sh opus_acquis_sp16000.inD.100k translation 
./preprocess.sh opus_acquis_sp16000.inD.100k translation

are the correct commands for the in-domain preprocessing (w/ 100k parallel data, no monolingual data).

As the paper describes, the number of data for training Sentencepiece and word2vec depends on the setting. To prepare in-domain Sentencepiece models (and word2vec) for OPUS-Acquis, we used

only 100k in-domain parallel data when monolingual data is not available (in Table 3, 100k in-domain parallel data)
715k in-domain parallel data (in Table 3, 715k in-domain parallel data)
100k in-domain parallel data + (715k - 100k)/ 2 in-domain monolingual data (in Table 5)

This is because the quality of in-domain subword tokenization is also important and we aimed to see how it changed depending on the size of available in-domain resources.

FYI, in fact, just running ./train.sh $model_id translation can automatically run all preprocessing except downloading. For example,

# [Out-domain, in Table 3]
./train.sh opus_it_sp16000.outD.all translation            
# [In-domain, in Table 3, 100k]
./train.sh opus_acquis_sp16000.inD.100k translation 
# [VA-LLM, in Table 3, 100k] (need to train Out-domain)
./train.sh opus_it_sp16000@opus_acquis_sp16000.va.v_opus_acquis_sp16000_100k.llm-idt.nn10.100k translation

Saltychtao commented 3 years ago

Thanks for your detailed response! I have started the training process successfully.

Saltychtao commented 3 years ago

Hello, sorry to bother you, but I met another error. I have successfully trained the opus_it_sp16000.outD.all model and opus_it_sp16000@opus_acquis_sp16000.va.v_opus_acquis_sp16000_100k.llm-idt.nn10.100k model, but when I ran the ./generate_many.sh opus_it_sp opus_law_sp translation command as instructed by the README.md, I've got the following error

(vocab_adpt) lijh@3090:~/vocabulary_adaptation$ CUDA_VISIBLE_DEVICES=2 bash ./generate_many.sh opus_it_sp opus_acquis_sp translation
Running './setup_sentencepiece.sh opus_it_sp16000@opus_acquis_sp16000.noadapt.all translation'...
Running './preprocess.sh opus_it_sp16000@opus_acquis_sp16000.noadapt.all translation'...
Creating binary files with fairseq format to 'dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/fairseq.all'...
Namespace(alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/fairseq.all', extra_features={}, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='de', srcdict='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/dict.de.txt', target_lang='en', task='translation', tbmf_wrapper=False, tensorboard_logdir='', testpref='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/test', tgtdict='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/dict.en.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/train.all', user_dir=None, validpref='dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/dev', workers=8)
Traceback (most recent call last):
  File "fairseq/preprocess.py", line 293, in <module>
    cli_main()
  File "fairseq/preprocess.py", line 289, in cli_main
    main(args)
  File "fairseq/preprocess.py", line 77, in main
    src_dict = task.load_dictionary(args.srcdict)
  File "/home/lijh/vocabulary_adaptation/fairseq/fairseq/tasks/fairseq_task.py", line 35, in load_dictionary
    return Dictionary.load(filename)
  File "/home/lijh/vocabulary_adaptation/fairseq/fairseq/data/dictionary.py", line 185, in load
    d.add_from_file(f, ignore_utf_errors)
  File "/home/lijh/vocabulary_adaptation/fairseq/fairseq/data/dictionary.py", line 202, in add_from_file
    raise fnfe
  File "/home/lijh/vocabulary_adaptation/fairseq/fairseq/data/dictionary.py", line 196, in add_from_file
    with open(f, 'r', encoding='utf-8') as fd:
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all/dict.de.txt'

Do I need to train all the models, i.e. outD, ft. va, before running the generate_many.sh ? Thanks in advance.

jack-and-rozz commented 3 years ago

No, you don't need training all models. This error is due to the evaluation scripts when preparing only several settings:

This was caused by the failure of running Out-domain setting in generate_many.sh. If you want to test only the VA-LLM setting, ./generate.sh opus_it_sp16000@opus_acquis_sp16000.va.v_opus_acquis_sp16000_100k.llm-idt.nn10.100k translation will work and make results to ${model_root}/${model_name}/tests/${tgt_domain}.outputs.
(You might not be interested in) To be more specific, the message shows that vocabulary lists (dict.en.txt and dict.de.txt, prepared in train_cbow.sh) for the Out-domain model were missing in dataset/koehn17six/processed/acquis/sp/v_opus_it_sp16000_all; this is a directory for in-domain data encoded by out-domain subword tokenizations. As I now added a few lines to generate_many.sh and train_cbow.sh for automatically preparing them, please retry generate_many.sh or manually copy the vocabulary files from dataset/koehn17six/processed/it/sp/v_opus_it_sp16000_all if you want to evaluate the Out-domain setting.

The lack of explanations and preparations for reproduction is my fault, I appreciate your report.

jack-and-rozz / vocabulary_adaptation

Scripts to prepare the dataset #1