karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.05k stars 836 forks source link

Loading data from disk partially #8

Open kathir-ks opened 7 months ago

kathir-ks commented 7 months ago

Training the tokenizer is memory intensive. It needs hundreds of GBs of RAM to train a tokenizer. What about using memmap to load only the required portion of the data from the disk?. Since mostly the access is sequential in nature, I would be significantly faster compared to random access. This would significantly reduce the memory required with a trade off with training time.

karpathy commented 7 months ago

Yeah definitely, an optimized version of the code (that does not yet exist) would absolutely have to worry about this.

kathir-ks commented 7 months ago

The approach would be to load a part of the txt file (depending upon the ram available) and write the merged pairs to another file and replace the earlier version.