There is a small problem about the training data format.

Because I have never trained the pre-training model, I have a small question about what the paralleldata input format looks likeTRAIN_FILE=/path/to/train/file. Do you need a separator between src and tgt? What is the format? In addition, can you fine-tune the xlm-roberta-large model?

And the xlm-roberta-base file contains these contents:config.json 、gitattributes 、pytorch_model.bin 、sentencepiece.bpe.model 、tokenizer.json

This error occurred while running：

10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/vocab.txt. We won't load it. 10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/added_tokens.json. We won't load it. 10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/special_tokens_map.json. We won't load it. 10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/tokenizer_config.json. We won't load it.

Now I really want to use bilingual data to continue training xlm-roberta-base model and ask for advice through TLM task.

neulab / awesome-align

There is a small problem about the training data format. #29