I tried to train the Falcon-7b model based on the tutorial from huggingface (https://colab.research.google.com/drive/1BiQiw31DT7-cDp1-0ySXvvhzqomTdI-o?usp=sharing) with my own dataset.
When i loaded my dataset i got indexing errors and my dataset seemed to be empty after beieng loaded into the trainer. Investigating the issue if found that the SFT Trainer basically removes every tokenized sample during the tokenize() function in _prepare_non_packed_dataloader() that has a length less than the max_seq_len. This also happens for the openassistant-guanaco dataset used in the tutorial, but it seemingly has quite a few entries that are longer than max_seq_len, so they are kept and the dataset is not empty.
Is this intended behavior? Wouldn't it be better to pad the shorter samples than to remove them or am i missing something?
I tried to train the Falcon-7b model based on the tutorial from huggingface (https://colab.research.google.com/drive/1BiQiw31DT7-cDp1-0ySXvvhzqomTdI-o?usp=sharing) with my own dataset. When i loaded my dataset i got indexing errors and my dataset seemed to be empty after beieng loaded into the trainer. Investigating the issue if found that the SFT Trainer basically removes every tokenized sample during the
tokenize()
function in_prepare_non_packed_dataloader()
that has a length less than the max_seq_len. This also happens for the openassistant-guanaco dataset used in the tutorial, but it seemingly has quite a few entries that are longer than max_seq_len, so they are kept and the dataset is not empty.Is this intended behavior? Wouldn't it be better to pad the shorter samples than to remove them or am i missing something?