neulab / awesome-align

A neural word aligner based on multilingual BERT
https://arxiv.org/abs/2101.08231
BSD 3-Clause "New" or "Revised" License
321 stars 46 forks source link

Reproduction of Ja-En #14

Closed twadada closed 3 years ago

twadada commented 3 years ago

Hi, thanks a lot for providing this code.

I tried to reproduce the result Ja-En (fine-tune, bilingual) but got worse results (AER= 44.3). I used the following command:

CUDA_VISIBLE_DEVICES=0 python awesome_align/run_train.py \ --output_dir=$OUTPUT_DIR \ --model_name_or_path=bert-base-multilingual-cased \ --extraction 'softmax' \ --do_train \ --train_mlm \ --train_tlm \ --train_tlm_full \ --train_so \ --train_psi \ --train_data_file=$TRAIN_FILE \ --per_gpu_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --save_steps 10000 \

I omitted "max_steps 40000" to ensure that the model is trained for one epoch (But the checkpoint-40000 model produced the similar result: AER 44.7). For the training data, I concatenated KFTT train, dev, tune and test (all pre-tokenized), which amounts to 444k sentences as written in the paper. As "examples/jaen.src-tgt" is lowercased, I finetuned mBERT on the training data with or without lowercasing, and got similar results.

I also used the same command to reproduce the result of Zh-En, and I got similar results reported in the paper (AER = 15.0).

zdou0830 commented 3 years ago

Hi, could you set max_steps to 40000 and see if you can reproduce the results? If still not, could you send me your datasets (zdou0830@gmail.com) and I'll try to figure out the reason.

Also, just to confirm, could you reproduce the results of multilingual BERT without fine-tuning?

twadada commented 3 years ago

Also, just to confirm, could you reproduce the results of multilingual BERT without fine-tuning?

Yes, I did in Ja-En.

However, actually I got a slightly different result in Zh-En; I got 17.9 but the paper says it is 18.1. I think this has something to do with how to handle possible alignments, which do not exist in Ja-En. In fact, using the ignorePossible option at tools/aer.py produced 18.1, and 17.9 without it. So I'm assuming you used the ignorePossible option to calculate the AER, is that correct?

zdou0830 commented 3 years ago

Thanks! Yes, for Zh-En that is correct.

twadada commented 3 years ago

So did you use the --ignorePossible option for all the language pairs? If so, I think that should be noted at least in README, as it is not the standard way of calculating AER.

zdou0830 commented 3 years ago

Nope, for all the other language pairs I didn't use that.

twadada commented 3 years ago

I see, but I think it should be clear that you used the option for zh-en to allow others to compare the scores in a fair condition. (apologies if it's been already mentioned somewhere). Thanks a lot for your quick responses!