wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
https://arxiv.org/abs/2103.06333
MIT License
186 stars 35 forks source link

questions about some implementations of PLBART #42

Closed freexxxyyy closed 2 years ago

freexxxyyy commented 2 years ago
  1. in the paper, it says "We mask 35% of the tokens in each instance." But it is "--mask 0.3" in https://github.com/wasiahmad/PLBART/blob/main/pretrain/pretrain.sh. Is there something that I understand wrongly?
  2. can the pre-process method be used in other languages, such as, C?  I use your method to train sentencepiece on some C corpus, it gives me the error:

Vocabulary size is smaller than required_chars. 50000 vs 693453. Increase vocab_size or decrease character_coverage with --character_coverage option. 

Also, here are some related logs:

trainer_interface.cc(466) LOG(INFO) all chars count=5475384381 trainer_interface.cc(477) LOG(INFO) Done: 100% characters are covered. trainer_interface.cc(487) LOG(INFO) Alphabet size=693450 trainer_interface.cc(488) LOG(INFO) Final character coverage=1

  1. for the parameters of the fairseq-train, except reading source codes, are there any other documents to describe these options? I found https://fairseq.readthedocs.io/en/latest/command_line_tools.html, but it is incomplete. Lots of your parsmeters are not shown in that doct.
  2. Also, do you know there are providen scripts to use fairseq to pre-train other BERT related models. Thanks
wasiahmad commented 2 years ago
  1. We indeed used 0.35 but in the script, we probably set 0.3.
  2. Yes, it should work for any data. From the error, it seems like you dataset is too small. So, you should reduce the vocab size.
  3. I am not sure.
  4. I do not know.
freexxxyyy commented 2 years ago

Thanks for your timely response.

We indeed used 0.35 but in the script, we probably set 0.3.

I thought that there may be not much differneces between 0.35 and 0.3

Yes, it should work for any data. From the error, it seems like you dataset is too small. So, you should reduce the vocab size.

Increase vocab_size, I thought it tells me "Increase vocab_size or decrease character_coverage with --character_coverage option. ”

I am not sure. I do not know.

Except reading source codes of fairseq, do you have any recommendations to learn about the fairseq?

wasiahmad commented 2 years ago

To learn about Fairseq, I think you should read the documentation. There is no tutorial that I can recommend.