Sharding the dataset not completing?

karpathy / build-nanogpt

Video+code lecture on building nanoGPT from scratch

2.84k stars 334 forks source link

Sharding the dataset not completing? #25

Open dustinwloring1988 opened 2 weeks ago

dustinwloring1988 commented 2 weeks ago

Below what what I get everytime I try to shard the dataset, it dose not look like the last one is completing I ran this multiply time and each time it stops in the same spot. Any ideas?

Shard 97: 100%|█████████████████████████████████████████████████▉| 99999910/100000000 [00:10<00:00, 9236426.65tokens/s] Shard 98: 100%|█████████████████████████████████████████████████▉| 99999499/100000000 [00:11<00:00, 8723382.11tokens/s] Shard 99: 54%|██████████████████████████▉ | 53989101/100000000 [00:08<00:07, 6051927.02tokens/s] PS E:\build-nanogpt-master\build-nanogpt-master>

bombless commented 2 weeks ago

Maybe your disk is full

dustinwloring1988 commented 2 weeks ago

@bombless , I thought this to originally so i moved it and add a local cache folder to that disk with the same results. I have plenty of room still. I also tried it with the 100B dataset and it did the same thing on the last shard but at a different percent?

I have started using this dataset to train on at home and will see if there is any negative results, perhaps I will just delete that shard encase it cut of mid sentence or something.

alexanderbowler commented 1 week ago

I don't believe this is an issue as the dataset is ~10B not exactly 10B tokens, so if you look in fine_web.py you'll see that once we have tokenized the last document it is simply written to file even though it is not filled, as we still want that last portion of data. The progress bar simply isn't calibrated as it is written still expecting 100,000,000 tokens in each shard even though for the last shard there isn't that much data. I can look into editing the progress bar, so it is a little prettier. Tldr: You are still properly tokenizing and using all the data from the dataset, even though the last shard doesn't say its filled.

lukasugar commented 1 week ago

+1 for @alexanderbowler Yep, the last batch is less than 100M, I've gotten the same numbers:

zzs97str commented 4 days ago

Due to the limited space left and computing source, I just want to get 10 shards to train instead of using the whole dataset. To realize this, I stop running code when there are 3 " downloading data 100% ". Looking through the cache data ,I find that all the filenames are strange numbers and letters, instead of "shard 00000、shrad 000001" something like that. What can I do ? Thanks for suggestios !! 20240630_225652