Dataset Format: revisions

iejMac / clip-video-encode

Easily compute clip embeddings from video frames

MIT License

136 stars 19 forks source link

Dataset Format: revisions #13

Closed iejMac closed 2 years ago

rom1504 commented 2 years ago

LGTM except one comment on naming of files inside tars Also advise you to be careful on the 0 padding. Make it configurable or automatically compute it from the length of the dataset. Otherwise it'll become an issue with bigger datasets

File naming seems like a detail but it tends to become an issue.

You may be interested by https://rom1504.medium.com/semantic-search-at-billions-scale-95f21695689a btw

iejMac commented 2 years ago

create_shard.py is wrong. Need to find a way to force ShardWriter to write an incomplete shard between splits. Currently train split has parts of val split and val split has parts test split