huggingface / lerobot

πŸ€— LeRobot: Making AI for Robotics more accessible with end-to-end learning
Apache License 2.0
6.54k stars 582 forks source link

Cannot "load_dataset()" from huggingface repo, maybe data structure bug? #255

Closed kandeng closed 3 months ago

kandeng commented 3 months ago

System Info

- `lerobot` version: unknown
- Platform: macOS-12.2.1-arm64-arm-64bit
- Python version: 3.12.1
- Huggingface_hub version: 0.23.0
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0 (False)
- Cuda version: N/A
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no

Information

Reproduction

  1. We wrote a simple python program to download dataset from huggingface, and saved it to our local disk, as the following,
from datasets import load_dataset_builder, load_from_disk, load_dataset
import pprint

repo_id = "lerobot/aloha_sim_insertion_human"
print(f"\n LeRobot '{repo_id}'.\n\n")

lerobot_dataset = load_dataset(repo_id)
print(f"\n LeRobot '{repo_id}' dataset features: ")

lerobot_dataset.save_to_disk(repo_id)
  1. When we ran this code, the system raised the following bug,
$ python3 load_lerobot_dataset.py

 LeRobot 'lerobot/aloha_sim_insertion_human'.

Traceback (most recent call last):
  File "/Users/dengkan/Projects/lerobot-main/datasets/load_lerobot_dataset.py", line 17, in <module>
    lerobot_dataset = load_dataset(repo_id)
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/load.py", line 2587, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/load.py", line 2259, in load_dataset_builder
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/load.py", line 1910, in dataset_module_factory
    raise e1 from None
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/load.py", line 1892, in dataset_module_factory
    ).get_module()
      ^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/load.py", line 1237, in get_module
    dataset_infos = DatasetInfosDict.from_dataset_card_data(dataset_card_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/info.py", line 464, in from_dataset_card_data
    dataset_info = DatasetInfo._from_yaml_dict(dataset_card_data["dataset_info"])
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/info.py", line 395, in _from_yaml_dict
    yaml_data["features"] = Features._from_yaml_list(yaml_data["features"])
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/features/features.py", line 1910, in _from_yaml_list
    return cls.from_dict(from_yaml_inner(yaml_data))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/features/features.py", line 1750, in from_dict
    obj = generate_from_dict(dic)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/features/features.py", line 1392, in generate_from_dict
    return {key: generate_from_dict(value) for key, value in obj.items()}
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/datasets/features/features.py", line 1398, in generate_from_dict
    raise ValueError(f"Feature type '{_type}' not found. Available feature types: {list(_FEATURE_TYPES.keys())}")
ValueError: Feature type 'VideoFrame' not found. Available feature types: ['Value', 'ClassLabel', 'Translation', 'TranslationVariableLanguages', 'Sequence', 'Array2D', 'Array3D', 'Array4D', 'Array5D', 'Audio', 'Image']

Expected behavior

Should save the downloaded dataset, into a local file without bugs.

Cadene commented 3 months ago

Try this:

in terminal

export HF_DATASETS_CACHE="/path/to/your/directory"

with python

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
LeRobotDataset("lerobot/aloha_sim_insertion_human")

in terminal

ls /path/to/your/directory
kandeng commented 3 months ago

Awesome, it works! @Cadene

$ python3 load_lerobot_dataset.py

 LeRobot 'lerobot/aloha_sim_insertion_human'.

Downloading readme: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 605/605 [00:00<00:00, 1.04MB/s]
Downloading data: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3.08M/3.08M [00:02<00:00, 1.27MB/s]
Generating train split: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 25000/25000 [00:00<00:00, 828697.65 examples/s]
Fetching 56 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 56/56 [00:00<00:00, 11919.87it/s]
$ tree .
.
β”œβ”€β”€ _Users_dengkan_Projects_lerobot-main_datasets_lerobot___aloha_sim_insertion_human_default_0.0.0_4bb2ef91f2cc0a4ea458fd2876cf4092e4f9720b.lock
β”œβ”€β”€ aloha_sim_insertion_human.zip
β”œβ”€β”€ downloads
β”‚Β Β  β”œβ”€β”€ a53e2381f1f504dc8f402011129b9dda57584c1b23e2c6f1725bccaafb13e276
β”‚Β Β  β”œβ”€β”€ a53e2381f1f504dc8f402011129b9dda57584c1b23e2c6f1725bccaafb13e276.json
β”‚Β Β  β”œβ”€β”€ a53e2381f1f504dc8f402011129b9dda57584c1b23e2c6f1725bccaafb13e276.lock
β”‚Β Β  β”œβ”€β”€ e07d4da387d815a7a6ae1eca78f96218193e77783496b8f65ab14e4b78fcd467.df4eab1c0af39638a2e1f5bbe094f472e627e7f2196d14230568c46e02bc6af0
β”‚Β Β  β”œβ”€β”€ e07d4da387d815a7a6ae1eca78f96218193e77783496b8f65ab14e4b78fcd467.df4eab1c0af39638a2e1f5bbe094f472e627e7f2196d14230568c46e02bc6af0.json
β”‚Β Β  └── e07d4da387d815a7a6ae1eca78f96218193e77783496b8f65ab14e4b78fcd467.df4eab1c0af39638a2e1f5bbe094f472e627e7f2196d14230568c46e02bc6af0.lock
β”œβ”€β”€ lerobot___aloha_sim_insertion_human
β”‚Β Β  └── default
β”‚Β Β      └── 0.0.0
β”‚Β Β          β”œβ”€β”€ 4bb2ef91f2cc0a4ea458fd2876cf4092e4f9720b
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ aloha_sim_insertion_human-train.arrow
β”‚Β Β          β”‚Β Β  └── dataset_info.json
β”‚Β Β          β”œβ”€β”€ 4bb2ef91f2cc0a4ea458fd2876cf4092e4f9720b.incomplete_info.lock
β”‚Β Β          └── 4bb2ef91f2cc0a4ea458fd2876cf4092e4f9720b_builder.lock
β”œβ”€β”€ load_lerobot_dataset.py
└── rotten_tomatoes
    β”œβ”€β”€ data-00000-of-00001.arrow
    β”œβ”€β”€ dataset_dict.json
    β”œβ”€β”€ dataset_info.json
    β”œβ”€β”€ state.json
    β”œβ”€β”€ test
    β”‚Β Β  β”œβ”€β”€ data-00000-of-00001.arrow
    β”‚Β Β  β”œβ”€β”€ dataset_info.json
    β”‚Β Β  └── state.json
    β”œβ”€β”€ train
    β”‚Β Β  β”œβ”€β”€ data-00000-of-00001.arrow
    β”‚Β Β  β”œβ”€β”€ dataset_info.json
    β”‚Β Β  └── state.json
    └── validation
        β”œβ”€β”€ data-00000-of-00001.arrow
        β”œβ”€β”€ dataset_info.json
        └── state.json

10 directories, 26 files

My code is very simple,

# load_lerobot_dataset.py

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

repo_id = "lerobot/aloha_sim_insertion_human"
print(f"\n LeRobot '{repo_id}'.\n\n")
dataset = LeRobotDataset(repo_id)

dataset_file = f"/Users/dengkan/Projects/lerobot-main/datasets/aloha_sim_insertion_human.zip"
torch.save(dataset, dataset_file)
kandeng commented 3 months ago

Remi's solution works!

Many thanks for help.