Open koukoulala opened 2 years ago
Hi,
Since xprophetnet is an enc-dec model, this should be fairly straightforward. Firstly, use the dev branch which will soon be merged with the main branch.
If you wish to pre-train your own prophetnet, you can use my code as it is but your main challenge will be modifying the loss function.
If you want to do pre-training EXACTLY like the prophetnet paper, then you will have to mess with the generate_batches_monolingual_masked method in common_utils.py where you have to do tokenization exactly how it is done in the paper. If you want to pre-fine tune an existing xprophetnet on your own monolingual data, then modifying generate_batches_monolingual_masked is very important.
If you wish to directly fine-tune xprophetnet then you will need to do the following:
As for your doubts:
Overall, I hope these points are useful to you. Feel free to make changes, test them and send a PR ;)
Very useful answers! I plan to generating a large total vocab and using mBART to pre-train my corpus first, and then introduce xProphetNet model if the performance is not ideal.
Thanks.
Hi,
Some tips for you in case you need them:
Good luck!
Hi, very helpful toolkit, I have learned a lot from it.
Recently, I have been focused on the multi-lingual title generation related tasks, and found that xProphetNet model has good performance, especially in the XGULE benchmarks. I wanted to distill a small xProphetNet model and pre-train on my own dataset. However, I did not find the relevant pre-training codes, so I would like to ask if you would consider adding the pre-training of xProphetNet model. I can provide the code of model architecture and fine-tuning (which can reproduce the results) process.
For possible doubts, I considered using the mBART model, but the tokenizer of the pre-trained mBART model is language-specific, and my own dataset cannot be language-specific for training. I considered putting all the data in the same file to generate a unified Tokenizer, but I was concerned that a relative reduction in the vocab_size might affect the model's effectiveness. Do you have any suggestions?
Thanks