Open gojkoc54 opened 10 months ago
I am also confused about this question. It seems that the method fim.permute()
returns samples of different length, but finally it will be chunked into seq_length inall_token_ids[i: i + seq_length]
, resulting samples like <fim-prefix>xxxxxxx<fim-suffix>xxx
which has no <fim-middle>
and following content. Is this a trick for better generalization?
Hi,
I want to finetune my model on FIM-only data. If I use this repo for FIM data formatting, seems like it could frequently happen that a single chunk (i.e. single element of
ConstantLengthDataset
) doesn't contain all the FIM components (or sometimes not containing any of them) due to long inputs that need to be chunked.Does this "hurt" the FIM training? Would it benefit from a different way of formatting/splitting the data so that all FIM components fit into a single chunk (so that they get passed to the model together)?
Thanks!