questions about dict.txt and data samples specify methods

wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

https://arxiv.org/abs/2103.06333

MIT License

186 stars 35 forks source link

Closed freexxxyyy closed 1 year ago

freexxxyyy commented 1 year ago

which step generates the dict.txt? It seems be generated during "fairseq-preprocess", but "fairseq-preprocess" also have a parameter "--srcdict $DICT_FILE".
how do you make the machine know the end of a data sample(which is a function in this case)? It seems that you use "\n", but functions also have "\n", I am confused about this.
Also, if fairseq-train FILENAME_OF_FIRST_DATASAMPLE: FILENAME_OF_SECOND_DATASAMPLE : FILENAME_OF_THIRD_DATASAMPLE:.....:FILENAME_OF_NTH_DATASAMPLE, will it work?

I am new to this. Thanks.

wasiahmad commented 1 year ago

freexxxyyy commented 1 year ago

Thanks. But I want to know why you start from the 4th line and why you replace all numbers with 100 in

cut -f1 $SPM_VOCAB | tail -n +4 | sed "s/$/ 100/g" >$DICT_FILE

Also, I am still confused about how you split different training samples. How does the model which part belongs to one training data sample.

Thanks

wasiahmad commented 1 year ago

Because first three tokens are <s> </s> and <unk> and the numbers do not matter, so I just used a constant.

We do not split training samples, they are functions. If they are lengthy (>512), we truncated them.

freexxxyyy commented 1 year ago

Thanks. But the model will think each function as one training samples? How does the model know where a function ends?

wasiahmad commented 1 year ago

In fairseq, we use </s> to indicate end-of-sequence.

freexxxyyy commented 1 year ago

where do you insert between functions to indicate the end-of-sequence? I see that you add "\n" after each function in https://github.com/wasiahmad/PLBART/blob/main/data/github/preprocessing/src/utils.py#L128

Thanks