Croissant refers to incomplete parquet branch in native parquet datasets

huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.

https://huggingface.co/docs/dataset-viewer

Apache License 2.0

698 stars 77 forks source link

Croissant refers to incomplete parquet branch in native parquet datasets #3101

Open fylux opened 2 weeks ago

fylux commented 2 weeks ago

The Croissant file exposed by HuggingFace seems to correspond to the parquet branch of the dataset, even when the dataset is native parquet:

IIUC, the parquet branch is not complete for datasets >5GB (not exactly like that since the 5GB are per split), but overall the branch can be often incomplete for large datasets. There are exceptions though, in this dataset the Parquet branch seems complete:

https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

Instead, there should be a way of retrieving a Croissant referring to the main native-parquet branch. Maybe for backward compatibility it would be better to expose both Croissant files (parquet branch and main branch) although exposing only the "complete" one could also be an option.

fylux commented 2 weeks ago

@lhoestq is there anything we can contribute to fix this or it needs to be done in the HuggingFace server?

lhoestq commented 1 week ago

Hi ! Yes if a dataset is already in parquet the croissant file doesn't need to point to the parquet branch (that may contain incomplete data). You can maybe check https://github.com/huggingface/dataset-viewer/blob/main/services/worker/src/worker/job_runners/dataset/croissant_crumbs.py and see if you can adapt the code for this case

fylux commented 1 week ago

My impression is that the required change should be deeper than croissant_crumbs.py.

In this file it assumes it already has a dataset info (containing config, splits), and IIUC the dataset_info is retrieved from the parque branch (or in general they is a clear mapping dataset_info -> parquet branch folder hierarchy). The first step would be to specify how the configs and splits are distributed across the folders of the main branch, which unlike the parquet branch it doesn't follow a predefined structure.

For example https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet/tree/refs%2Fconvert%2Fparquet :

Parquet branch (config/split):

default/partial-train/0000.parquet

Main branch (arbitrary set of folders):

filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.parquet

lhoestq commented 1 week ago

you can use the datasets library to list the files of a given dataset:


>>> from datasets import load_dataset_builder
>>> builder = load_dataset_builder("mlfoundations/dclm-baseline-1.0-parquet")
>>> builder.config.data_files
{
    NamedSplit('train'): [
        'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.parquet',
        'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000001_processed.parquet'
        ...
    ]
}

fylux commented 1 week ago

Thanks @lhoestq, that's exactly what I was looking for. And combined with listing the configs, it should be able to cover the mapping config/split -> paths in main branch:

from datasets import load_dataset_builder, get_dataset_config_names
get_dataset_config_names("ai4bharat/sangraha") # ['verified', 'unverified', 'synthetic']
builder = load_dataset_builder("ai4bharat/sangraha","synthetic")
builder.config.data_files

Probably one of the challenges that we can face is that the data_files are listed individually (without globs) so if we list each file individually it could lead to huge Croissant files.

lhoestq commented 6 days ago

Alternatively we can rely on the dataset-compatible-libraries job in dataset-viewer, which creates code snippet e.g. for Dask using a glob pattern for the parquet files. The glob pattern is computed using heuristics.

For example for dclm it obtains this code pattern and glob:

import dask.dataframe as dd

df = dd.read_parquet("hf://datasets/mlfoundations/dclm-baseline-1.0-parquet/**/*_train/**")

So we could just reuse the glob, that is stored along with the generated code snippet (no need to parse the code)