Feature request: loading a record set as a pandas dataframe

ogrisel commented 1 week ago

If the file object is a CSV, TSV or parquet file, mlcroissant is already using pandas in its internals. However I could not find any public API to fetch a record set as a pandas dataframe.

After a bit of tweaking, the closest thing I could achieve with the public API was:

import pandas as pd
import mlcroissant as mlc

dataset_url = "..."
record_set_name = "..."

dataset = mlc.Dataset(dataset_url)
df = pd.DataFrame.from_records(list(dataset.records(record_set_name)))

but it seems incredibly inefficient for many reasons:

we need to allocate a temporary list because the Records iterable has no __len__ attribute: this means that we allocate a lot of memory to temporarily store all those records as a list of dicts of Python objects before being able to load them efficiently into the pandas dataframe,
the records iterable generates many temporary Python scalar objects (str, int, float, ...) in the process and then will be garbage collected afterwards once consumed by pd.DataFrame.from_records: this causes a lot of unnecessary overhead via Python GC housekeeping of many small objects for no good reason.
the intermediate Python objects in the records do not preserve the original dtype information (int32 vs int64 vs uint8... or nominal or ordinal categorical dtypes), hence the resulting dataframe might loose important side information for the downstream tasks. Some of this information (e.g. categorical dtype info) might be present in the .metadata attribute of the dataset but that requires extra effort to retyped the dataframe columns using this and it's also yet another cause of inefficiency.

All of those problems would vanish if there was a way to access the underlying internal pandas dataframe whenever a given records is only backed by a single file object read by pandas.

marcenacp commented 1 week ago

@ogrisel Thanks for creating the issue! It's a great feature.

The API doesn't exist yet, I agree it could easily work for small datasets (=backed by 1 file) without joins.

ogrisel commented 1 week ago

Even for datasets with multiple record sets, it would be nice to allow the user to retrieve each of them as a dataframe and let them use pandas to compute merge or aggregations as they want.

ogrisel commented 1 week ago

I agree those, that for records sets backed by multiple file objects this would be more challenging / not possible to achieve.

mlcommons / croissant

Feature request: loading a record set as a pandas dataframe #706