Formatting FIM data - Githubissues

loubnabnl / santacoder-finetuning

Fine-tune SantaCoder for Code/Text Generation.

Apache License 2.0

179 stars 22 forks source link

Hi,

I want to finetune my model on FIM-only data. If I use this repo for FIM data formatting, seems like it could frequently happen that a single chunk (i.e. single element of ConstantLengthDataset) doesn't contain all the FIM components (or sometimes not containing any of them) due to long inputs that need to be chunked.

Does this "hurt" the FIM training? Would it benefit from a different way of formatting/splitting the data so that all FIM components fit into a single chunk (so that they get passed to the model together)?

Thanks!

loubnabnl / santacoder-finetuning

Formatting FIM data #22