Open versae opened 2 months ago
@versae have you tried disabling and seeing if it fixes?
Yes, I now set TOKENIZERS_PARALLELISM
to false
in my setup scripts. It seems to help, but not sure it is the definitive fix.
interesting ok, I guess it's time to give up on that then. Do you reduce the batch size?
Yes, for processing very very long documents (tens of millions of tokens) I had to set it to 1 and set TOKENIZERS_PARALLELISM
to False
. Slower, but at least it hasn't failed me yet. Is the batch size of the tokenizer something we can set in the config for the training?
OK, I think I found a winning combination, setting SLURM_CPUS_ON_NODE=16 TOKENIZERS_PARALLELISM=false
seems to work with the current batch size. On a TPUv4-32, 3 out of 4 nodes sometimes fail right after loading the weights, but the 1 that keeps running is able to finish the tokenization. So I just leave it running and when it's done I restart training without SLURM_CPUS_ON_NODE
.
The workers OOMs a few times during the tokenization of a dataset with very long documents (over 1M chars), but succeed in the end by adjusting batch size of
BatchTokenizer
and just retrying.@dlwh:
Could this be the reason? https://github.com/stanford-crfm/levanter/blob/2516d06be6fec2ff5660f144649a9f5f577b06e9/src/levanter/data/text.py#L308-L312
It seems it will in most cases enable multithreading in Rust.