some confusions - Githubissues

neulab / awesome-align

A neural word aligner based on multilingual BERT

https://arxiv.org/abs/2101.08231

BSD 3-Clause "New" or "Revised" License

321 stars 46 forks source link

some confusions #17

Closed leileilin closed 3 years ago

leileilin commented 3 years ago

Hello, I have read your paper and GitHub project carefully. I find that you do not use a large number of Chinese English Parallel Corpus in your paper, but only use annotated Chinese English data. Is that so?

zdou0830 commented 3 years ago

Thanks! Yes, as in the paper, we treat the evaluation set in (http://nlp.csai.tsinghua.edu.cn/~ly/systems/TsinghuaAligner/TsinghuaAligner.html) as the training data (~40k parallel sentences as in Table 1) and use the test set in Liu and Sun [2015].

leileilin commented 3 years ago

http://nlp.csai.tsinghua.edu.cn/~ly/systems/TsinghuaAligner/TsinghuaAligner.html

Thanks! Yes, as in the paper, we treat the evaluation set in (http://nlp.csai.tsinghua.edu.cn/~ly/systems/TsinghuaAligner/TsinghuaAligner.html) as the training data (~40k parallel sentences as in Table 1) and use the test set in Liu and Sun [2015].

I have another question. Your training set and test set should be created by Tsinghua University, so is your test set divided into 500 pieces from 40K?

zdou0830 commented 3 years ago

If you downloaded their dataset, you can see that there are two folders named "v1" and "v2". We used the test data in "v1" as our test data and all the data in "v2" (which contains the data in "v1") as our training data.

leileilin commented 3 years ago

If you downloaded their dataset, you can see that there are two folders named "v1" and "v2". We used the test data in "v1" as our test data and all the data in "v2" (which contains the data in "v1") as our training data.

Well, thank you man. I have another question. You mentioned the unsupervised fine-tuning with a large number of parallel corpus in GitHub project, but you didn't do it, did you? That is to say, you don't know the effect of first fine-tuning with a large number of parallel corpus and then fine-tuning with annotated data sets?

zdou0830 commented 3 years ago

All of our main results (Table 2 and the table in the GitHub repo) are obtained by first performing unsupervised training with the training data (which is the data in the "v2" folder" in this case) and directly testing the fine-tuned models on the test data. Note that we didn't use their word alignment annotations in these settings.

We did try to do word alignment in supervised settings where we did use the word alignment annotations as you can see in Table 6, though that's not the main focus of our paper.

我们用v2 data进行训练的时候并没有用他们的alignment信息，只是用了他们提供的parallel data进行unsupervised fine-tuning。在表6我们也提供了如果进行supervised training、利用alignment信息的结果，但是这并不是我们文章的重点。

leileilin commented 3 years ago

All of our main results (Table 2 and the table in the GitHub repo) are obtained by first performing unsupervised training with the training data (which is the data in the "v2" folder" in this case) and directly testing the fine-tuned models on the test data. Note that we didn't use their word alignment annotations in these settings.

We did try to do word alignment in supervised settings where we did use the word alignment annotations as you can see in Table 6, though that's not the main focus of our paper.

我们用v2 data进行训练的时候并没有用他们的alignment信息，只是用了他们提供的parallel data进行unsupervised fine-tuning。在表6我们也提供了如果进行supervised training、利用alignment信息的结果，但是这并不是我们文章的重点。

非常感谢！所以你就只是使用40k的中英文的平行语料做无监督微调？顺便问一下w/o setting具体指什么，你在论文中并未提及。

zdou0830 commented 3 years ago

是的。w/o setting对应Table 2的w/o fine-tuning (without fine-tuning)，就是直接用不进行任何fine-tuning的mBERT算alignment。

leileilin commented 3 years ago

是的。w/o setting对应Table 2的w/o fine-tuning (without fine-tuning)，就是直接用不进行任何fine-tuning的mBERT算alignment。

好的，明白了，非常感谢！