Closed StarCycle closed 3 months ago
@StarCycle Wow interesting findings :O
What's the error message? 37k videos doesnt look like a big number to me. cc @lhoestq for visibility
As a workaround, you would need to encode several episodes in the same video. This might be already supported (but not tested). You would need to adjust the code snippet bellow to use the same filename for the grouped episodes and to have a continuous range of timestamps along the episodes (no restart to timestamp=0).
# store the reference to the video frame
ep_dict[img_key] = [
{"path": f"videos/{fname}", "timestamp": i / fps} for i in range(num_frames)
]
I would recommend working on a unit test first, to make sure this is supported ;)
Looking forward to your PR to add this unit test! if you have the time 🙏
FYI the HF hub has a hard limit of 10k files per directory
@Cadene @lhoestq
Thanks! I am trying to upload a new dataset, which combines 100 episodes in a single mp4 file.
I will attach my conversion code in the HuggingFace repo. It converts an LMDB version of CALVIN dataset to the lerobot format (not directly from the official dataset) so perhaps we should not merge it to lerobot. I will simply transfer ABC->D and ABCD->D dataset to lerobot format and I guess that's enough for CALVIN...
What I did in the LMDB version dataset:
If the conversion is successful, I may transfer Droid into Lerobot dataset and I will definitely propose PRs for Droid.
Best, StarCycle
@StarCycle If the number of files per directory is the limiting factor, then you can use sharding to group several videos in a common directory.
System Info
Information
Reproduction
Expected behavior
The main reason is that the converted dataset contain 37k mp4 files. The number is too large for the current tool to handle...