loubnabnl / santacoder-finetuning

Fine-tune SantaCoder for Code/Text Generation.
Apache License 2.0
179 stars 22 forks source link

Formatting FIM data #22

Open gojkoc54 opened 10 months ago

gojkoc54 commented 10 months ago

Hi,

I want to finetune my model on FIM-only data. If I use this repo for FIM data formatting, seems like it could frequently happen that a single chunk (i.e. single element of ConstantLengthDataset) doesn't contain all the FIM components (or sometimes not containing any of them) due to long inputs that need to be chunked.

Does this "hurt" the FIM training? Would it benefit from a different way of formatting/splitting the data so that all FIM components fit into a single chunk (so that they get passed to the model together)?

Thanks!

hanlinGao commented 6 months ago

I am also confused about this question. It seems that the method fim.permute() returns samples of different length, but finally it will be chunked into seq_length inall_token_ids[i: i + seq_length], resulting samples like <fim-prefix>xxxxxxx<fim-suffix>xxx which has no <fim-middle>and following content. Is this a trick for better generalization?