I found in the pre-trained datasets, there are some docs has large amount chars, which cause a long time to encode them. For example, a doc has 15955671 chars, will cost 6.6 hours to encode it.
How do you speedup it? split the doc into many sub-docs? But I use the megatron to pre-train, has any idea?
Looking forward to hearing from you in your free time. Thank you very much.
Wrong repo? This is tinyLlama, not megatron LM, besides, for tokenizer.encode, Thats most likely an hf method, so you would have to look at the HF repos instead.
if possible, you could try tiktoken, which is supposably 3x faster
I found in the pre-trained datasets, there are some docs has large amount chars, which cause a long time to encode them. For example, a doc has 15955671 chars, will cost 6.6 hours to encode it.
How do you speedup it? split the doc into many sub-docs? But I use the megatron to pre-train, has any idea?
Looking forward to hearing from you in your free time. Thank you very much.