huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.7k stars 2.59k forks source link

Convert polars DataFrame back to datasets #6984

Open ljw20180420 opened 1 week ago

ljw20180420 commented 1 week ago

Feature request

This returns error.

from datasets import Dataset

dsdf = Dataset.from_dict({"x": [[1, 2], [3, 4, 5]], "y": ["a", "b"]})
Dataset.from_polars(dsdf.to_polars())

ValueError: Arrow type large_list does not have a datasets dtype equivalent.

Motivation

When datasets contain Sequence data type, it will be converted to Arrow type large_list. However, the reverse (from large_list to Sequence) does not work.

Your contribution

No

lhoestq commented 3 days ago

Hi ! Thanks for reporting :)

We don't support large_list yet, though it should be added to Sequence IMO (maybe with a parameter large=True ?)