Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.
batch_tokenize does not return the last batch, which is the remainder after dividing by the preprocessing batch_size. Up to 1,000 lines can be dropped from preprocessing.
This would be fixed by adding a yield after the for loop in the batched() function at mistral/src/corpora/tokenization_utils.py#L22
batch_tokenize
does not return the last batch, which is the remainder after dividing by the preprocessing batch_size. Up to 1,000 lines can be dropped from preprocessing.This would be fixed by adding a yield after the for loop in the
batched()
function atmistral/src/corpora/tokenization_utils.py#L22