How to speedup tokenizer.encode?

jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Apache License 2.0

7.61k stars 444 forks source link

How to speedup tokenizer.encode? #62

Closed PeiqinSun closed 10 months ago

PeiqinSun commented 11 months ago

I found in the pre-trained datasets, there are some docs has large amount chars, which cause a long time to encode them. For example, a doc has 15955671 chars, will cost 6.6 hours to encode it.

How do you speedup it? split the doc into many sub-docs? But I use the megatron to pre-train, has any idea?

Looking forward to hearing from you in your free time. Thank you very much.

VatsaDev commented 11 months ago

Wrong repo? This is tinyLlama, not megatron LM, besides, for tokenizer.encode, Thats most likely an hf method, so you would have to look at the HF repos instead.

if possible, you could try tiktoken, which is supposably 3x faster

PeiqinSun commented 10 months ago

Thanks for your time.