mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
344 stars 38 forks source link

Feature request: loading a record set as a pandas dataframe #706

Open ogrisel opened 1 week ago

ogrisel commented 1 week ago

If the file object is a CSV, TSV or parquet file, mlcroissant is already using pandas in its internals. However I could not find any public API to fetch a record set as a pandas dataframe.

After a bit of tweaking, the closest thing I could achieve with the public API was:

import pandas as pd
import mlcroissant as mlc

dataset_url = "..."
record_set_name = "..."

dataset = mlc.Dataset(dataset_url)
df = pd.DataFrame.from_records(list(dataset.records(record_set_name)))

but it seems incredibly inefficient for many reasons:

All of those problems would vanish if there was a way to access the underlying internal pandas dataframe whenever a given records is only backed by a single file object read by pandas.

marcenacp commented 1 week ago

@ogrisel Thanks for creating the issue! It's a great feature.

The API doesn't exist yet, I agree it could easily work for small datasets (=backed by 1 file) without joins.

ogrisel commented 1 week ago

Even for datasets with multiple record sets, it would be nice to allow the user to retrieve each of them as a dataframe and let them use pandas to compute merge or aggregations as they want.

ogrisel commented 1 week ago

I agree those, that for records sets backed by multiple file objects this would be more challenging / not possible to achieve.