huggingface / lerobot

🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning
Apache License 2.0
6.54k stars 582 forks source link

Cannot upload dataset with too many episodes #291

Closed StarCycle closed 3 months ago

StarCycle commented 3 months ago

System Info

latest lerobot with standard conda environment from your readme

Information

Reproduction

  1. Download the transfered CALVIN dataset: https://drive.google.com/drive/folders/1oXsaFUXcW-8ykNMAu59jkkJVXVzcp5Nd?usp=sharing
  2. Upload it to HuggingFace. I tried (1) using lerobot (2) using HAapi.upload_folder (3) using HAapi.upload_file, but they all failed

Expected behavior

The main reason is that the converted dataset contain 37k mp4 files. The number is too large for the current tool to handle...

Cadene commented 3 months ago

@StarCycle Wow interesting findings :O

What's the error message? 37k videos doesnt look like a big number to me. cc @lhoestq for visibility

As a workaround, you would need to encode several episodes in the same video. This might be already supported (but not tested). You would need to adjust the code snippet bellow to use the same filename for the grouped episodes and to have a continuous range of timestamps along the episodes (no restart to timestamp=0).

# store the reference to the video frame
ep_dict[img_key] = [
    {"path": f"videos/{fname}", "timestamp": i / fps} for i in range(num_frames)
]

I would recommend working on a unit test first, to make sure this is supported ;)

Looking forward to your PR to add this unit test! if you have the time 🙏

lhoestq commented 3 months ago

FYI the HF hub has a hard limit of 10k files per directory

StarCycle commented 3 months ago

@Cadene @lhoestq

Thanks! I am trying to upload a new dataset, which combines 100 episodes in a single mp4 file.

I will attach my conversion code in the HuggingFace repo. It converts an LMDB version of CALVIN dataset to the lerobot format (not directly from the official dataset) so perhaps we should not merge it to lerobot. I will simply transfer ABC->D and ABCD->D dataset to lerobot format and I guess that's enough for CALVIN...

What I did in the LMDB version dataset:

If the conversion is successful, I may transfer Droid into Lerobot dataset and I will definitely propose PRs for Droid.

Best, StarCycle

Cadene commented 3 months ago

@StarCycle If the number of files per directory is the limiting factor, then you can use sharding to group several videos in a common directory.

StarCycle commented 3 months ago

@Cadene Thanks! I successfully upload it here and I combine every 100 episodes to 1 episode. I will try to use this dataset to train the net