_prepare_non_packed_dataloader() is removing shorter samples

Blubberblub commented 1 year ago

I tried to train the Falcon-7b model based on the tutorial from huggingface (https://colab.research.google.com/drive/1BiQiw31DT7-cDp1-0ySXvvhzqomTdI-o?usp=sharing) with my own dataset. When i loaded my dataset i got indexing errors and my dataset seemed to be empty after beieng loaded into the trainer. Investigating the issue if found that the SFT Trainer basically removes every tokenized sample during the tokenize() function in _prepare_non_packed_dataloader() that has a length less than the max_seq_len. This also happens for the openassistant-guanaco dataset used in the tutorial, but it seemingly has quite a few entries that are longer than max_seq_len, so they are kept and the dataset is not empty.

Is this intended behavior? Wouldn't it be better to pad the shorter samples than to remove them or am i missing something?

Blubberblub commented 1 year ago

Same as https://github.com/lvwerra/trl/issues/467

lvwerra commented 1 year ago

We are fixing this in #481. Closing this as it's a duplicate issue.

huggingface / trl

_prepare_non_packed_dataloader() is removing shorter samples #483