princeton-nlp / CEPE

[ACL 2024] Long-Context Language Modeling with Parallel Encodings
https://arxiv.org/abs/2402.16617
MIT License
118 stars 8 forks source link

Training data processing is slow #4

Open yunlongia opened 1 month ago

yunlongia commented 1 month ago

Hello, I'm processing redpajama data and it's unacceptably slow, especially processing book domain, any suggestions please? Or can you share a copy of your processed training data, thanks a lot!

howard-yen commented 1 month ago

Hi, can you please share some details on what step is giving you trouble?

If you are running into slow speed with the tokenization, then I would recommend checking out the SentencePiece tokenizer instead of using the Transformers tokenizer (I talk about it here). From my experience, the SentencePiece tokenizer is much faster with longer sequences (which matters a lot for the books domain), whereas the Transformers tokenizer is faster at large batches of shorter sequences. It is easy to switch over to the SentencePiece tokenizer, simply by uncommenting this line. You can also shard this process across many processes if you have the CPU cores to do so. To do this, you can change this line and specify a large shard_size.

If you are running in slow processing for the sampling step, you can try increasing the number of shards as we talk about it here. Let me know if you need help with something else!

I would be happy to share the training data, though it totals to about 5T, which can be very slow over a network. If you are still running issues and want me to send the training data, please email me at hyen@princeton.edu and we can figure a way to do this :)