Last batch dropped during preprocessing

stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.

Apache License 2.0

562 stars 49 forks source link

Last batch dropped during preprocessing #189

Closed toizzy closed 2 years ago

toizzy commented 2 years ago

batch_tokenize does not return the last batch, which is the remainder after dividing by the preprocessing batch_size. Up to 1,000 lines can be dropped from preprocessing.

This would be fixed by adding a yield after the for loop in the batched() function at mistral/src/corpora/tokenization_utils.py#L22

dlwh commented 2 years ago

fixed in fa404d7. Accidentally got dropped in a PR and I missed it in review