LGTM except one comment on naming of files inside tars
Also advise you to be careful on the 0 padding. Make it configurable or automatically compute it from the length of the dataset.
Otherwise it'll become an issue with bigger datasets
File naming seems like a detail but it tends to become an issue.
create_shard.py is wrong. Need to find a way to force ShardWriter to write an incomplete shard between splits. Currently train split has parts of val split and val split has parts test split
LGTM except one comment on naming of files inside tars Also advise you to be careful on the 0 padding. Make it configurable or automatically compute it from the length of the dataset. Otherwise it'll become an issue with bigger datasets
File naming seems like a detail but it tends to become an issue.
You may be interested by https://rom1504.medium.com/semantic-search-at-billions-scale-95f21695689a btw