Closed freexxxyyy closed 1 year ago
Thanks. But I want to know why you start from the 4th line and why you replace all numbers with 100 in
cut -f1 $SPM_VOCAB | tail -n +4 | sed "s/$/ 100/g" >$DICT_FILE
Also, I am still confused about how you split different training samples. How does the model which part belongs to one training data sample.
Thanks
Because first three tokens are <s>
</s>
and <unk>
and the numbers do not matter, so I just used a constant.
We do not split training samples, they are functions. If they are lengthy (>512), we truncated them.
Thanks. But the model will think each function as one training samples? How does the model know where a function ends?
In fairseq, we use </s>
to indicate end-of-sequence.
where do you insert between functions to indicate the end-of-sequence? I see that you add "\n" after each function in https://github.com/wasiahmad/PLBART/blob/main/data/github/preprocessing/src/utils.py#L128
Thanks
I am new to this. Thanks.