neulab / awesome-align

A neural word aligner based on multilingual BERT
https://arxiv.org/abs/2101.08231
BSD 3-Clause "New" or "Revised" License
321 stars 46 forks source link

There is a small problem about the training data format. #29

Closed genbei closed 2 years ago

genbei commented 2 years ago

Because I have never trained the pre-training model, I have a small question about what the paralleldata input format looks likeTRAIN_FILE=/path/to/train/file. Do you need a separator between src and tgt? What is the format? In addition, can you fine-tune the xlm-roberta-large model?

And the xlm-roberta-base file contains these contents:config.json 、gitattributes 、pytorch_model.bin 、sentencepiece.bpe.model 、tokenizer.json

This error occurred while running:

10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/vocab.txt. We won't load it. 10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/added_tokens.json. We won't load it. 10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/special_tokens_map.json. We won't load it. 10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/tokenizer_config.json. We won't load it.

Now I really want to use bilingual data to continue training xlm-roberta-base model and ask for advice through TLM task.

zdou0830 commented 2 years ago

As in README, the inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

I haven't fine-tuned xlm-roberta-large before, but I think you can use the code in the xlmr branch and tune some parameters (e.g. align_layer, learning_rate, max_steps) and see if you can get reasonable performance.