neulab / awesome-align

A neural word aligner based on multilingual BERT
https://arxiv.org/abs/2101.08231
BSD 3-Clause "New" or "Revised" License
321 stars 46 forks source link

traning is increasing my aer #36

Closed vigneshmj1997 closed 2 years ago

vigneshmj1997 commented 2 years ago

I have traning bert-base-cases with wmt data of de-en ,fr-en ,es-en ,ja-en ,ro-en and zh-en and my are has increased enfr_output: 14.5% (87.3%/82.9%/5906) jaen_output: 72.1% (49.3%/19.5%/5358) zhen_output: 72.1% (70.0%/17.4%/2802) roen_output: 34.5% (78.8%/56.1%/4409)

from base enfr_output: 5.6% (94.7%/94.0%/6015) jaen_output: 45.6% (67.9%/45.4%/9079) zhen_output: 17.9% (82.8%/81.3%/11173) roen_output: 27.9% (85.2%/62.5%/4548)

can anyone help me with where it went wrong

zdou0830 commented 2 years ago

Hi, I'd suggest you to

1) check the training command, e.g. it's recommended to set max_steps < 50k;

2) check if the training sentences are tokenized and if the training/testing tokenizers are consistent.

3) debug in the monolingual settings and subsample <400k sentence pairs first.

vigneshmj1997 commented 2 years ago

I am fine-tuning model from bert-base-cased and my parameters are same are given in readme so max_steps 20000 and my training and testing tokenizer are consistent is there any thing more i should consider while fine-tuning that can increase my are

zdou0830 commented 2 years ago

I think you can first try to reproduce the results in the paper, then debug in the monolingual settings. If AER still increases, it might be a domain mismatch issue.

Also, could you explain a bit more on how you make the tokenizers consistent? I think for the zh-en test data, the Chinese sentences are segmented into words by first using an in-house tool and then manually corrected by humans, which makes it nearly impossible to make the tokenizers consistent when using external training data?