The dtype of tokenized data should be uint32

princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

https://arxiv.org/abs/2310.06694

MIT License

533 stars 39 forks source link

The dtype of tokenized data should be uint32 #65

Closed ZhiYuanZeng closed 3 months ago

ZhiYuanZeng commented 5 months ago

In tokenize_single_file.py (line 61), the dtype of data saved in .npy file is set to be uint16. However it is not correct for the case where vocabulary size is large than 65535. It is more safe to set it to uint32, although it doubles the cost of storage.

xiamengzhou commented 3 months ago

Thanks for the suggestion! We worked with uint16 to save space for llama2, which only has a vocab of 32000, you are correct that we need to use uint32 for larger vocabs.