Open fylux opened 2 weeks ago
@lhoestq is there anything we can contribute to fix this or it needs to be done in the HuggingFace server?
Hi ! Yes if a dataset is already in parquet the croissant file doesn't need to point to the parquet branch (that may contain incomplete data). You can maybe check https://github.com/huggingface/dataset-viewer/blob/main/services/worker/src/worker/job_runners/dataset/croissant_crumbs.py and see if you can adapt the code for this case
My impression is that the required change should be deeper than croissant_crumbs.py.
In this file it assumes it already has a dataset info (containing config, splits), and IIUC the dataset_info is retrieved from the parque branch (or in general they is a clear mapping dataset_info -> parquet branch folder hierarchy). The first step would be to specify how the configs and splits are distributed across the folders of the main branch, which unlike the parquet branch it doesn't follow a predefined structure.
For example https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet/tree/refs%2Fconvert%2Fparquet :
default/partial-train/0000.parquet
filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.parquet
you can use the datasets
library to list the files of a given dataset:
>>> from datasets import load_dataset_builder
>>> builder = load_dataset_builder("mlfoundations/dclm-baseline-1.0-parquet")
>>> builder.config.data_files
{
NamedSplit('train'): [
'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.parquet',
'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000001_processed.parquet'
...
]
}
Thanks @lhoestq, that's exactly what I was looking for. And combined with listing the configs, it should be able to cover the mapping config/split -> paths in main branch:
from datasets import load_dataset_builder, get_dataset_config_names
get_dataset_config_names("ai4bharat/sangraha") # ['verified', 'unverified', 'synthetic']
builder = load_dataset_builder("ai4bharat/sangraha","synthetic")
builder.config.data_files
Probably one of the challenges that we can face is that the data_files are listed individually (without globs) so if we list each file individually it could lead to huge Croissant files.
Alternatively we can rely on the dataset-compatible-libraries job in dataset-viewer, which creates code snippet e.g. for Dask using a glob pattern for the parquet files. The glob pattern is computed using heuristics.
For example for dclm it obtains this code pattern and glob:
import dask.dataframe as dd
df = dd.read_parquet("hf://datasets/mlfoundations/dclm-baseline-1.0-parquet/**/*_train/**")
So we could just reuse the glob, that is stored along with the generated code snippet (no need to parse the code)
The Croissant file exposed by HuggingFace seems to correspond to the parquet branch of the dataset, even when the dataset is native parquet:
https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet
https://huggingface.co/datasets/ai4bharat/sangraha
https://huggingface.co/datasets/BleachNick/UltraEdit_500k
IIUC, the parquet branch is not complete for datasets >5GB (not exactly like that since the 5GB are per split), but overall the branch can be often incomplete for large datasets. There are exceptions though, in this dataset the Parquet branch seems complete:
Instead, there should be a way of retrieving a Croissant referring to the main native-parquet branch. Maybe for backward compatibility it would be better to expose both Croissant files (parquet branch and main branch) although exposing only the "complete" one could also be an option.