Counting untrained tokens ? What is it doing ? Took forever on large dataset, and repeating !

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

18.59k stars 1.3k forks source link

Counting untrained tokens ? What is it doing ? Took forever on large dataset, and repeating ! #1125

Open thusinh1969 opened 1 month ago

thusinh1969 commented 1 month ago

Counting untrained tokens: 50%|█████████████████████████████████████▊ | 66000/132738 [05:19<05:28, 203.40 examples/s]

It says 5:28 but in fact it took 25 minutes more !

It always took sometime up to 1 hour or so do do this. On large few millions rows, it takes also forever and repeating everytime we restart or resume.

What is it doing on counting what tokens and why ? Can we turn it off ?

Thanks, Steve

Sneakr commented 1 month ago

@thusinh1969 Maybe this is related: https://github.com/unslothai/unsloth/issues/658#issuecomment-2175416360