huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.6k stars 2.55k forks source link

Improve performance of JSON loader #6867

Closed albertvillanova closed 2 weeks ago

albertvillanova commented 3 weeks ago

As reported by @natolambert, loading regular JSON files with datasets shows poor performance.

The cause is that we use the json Python standard library instead of other faster libraries. See my old comment: https://github.com/huggingface/datasets/pull/2638#pullrequestreview-706983714

There are benchmarks that compare different JSON packages, with the Standard Library one among the worst performant:

I remember having a discussion about this and it was decided that it was better not to include an additional dependency on a 3rd-party library.

However:

natolambert commented 3 weeks ago

Thanks! Feel free to ping me for examples. May not respond immediately because we're all busy but would like to help.

albertvillanova commented 3 weeks ago

Hi @natolambert, could you please give some examples of JSON files to benchmark?

Please note that this JSON file (https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set-scores/Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback.json) is not in "records" orient; instead it has the following structure:

{
  "chat_template": "tulu",
  "id": [30, 34, 35,...],
  "model": "Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback",
  "model_type": "Seq. Classifier",
  "results": [1, 1, 1, ...],
  "scores_chosen": [4.421875, 1.8916015625, 3.8515625,...],
  "scores_rejected": [-2.416015625, -1.47265625, -0.9912109375,...],
  "subset": ["alpacaeval-easy", "alpacaeval-easy", "alpacaeval-easy",...]
  "text_chosen": ["<s>[INST] How do I detail a...",...],
  "text_rejected": ["<s>[INST] How do I detail a...",...]
}

Note that "records" orient should be a list (not a dict) with each row as one item of the list:

[
  {"chat_template": "tulu", "id": 30,... },
  {"chat_template": "tulu", "id": 34,... },
  ...
]
natolambert commented 2 weeks ago

We use a mix (which is a mess), here's an example with the records orient https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/best-of-n/alpaca_eval/tulu-13b/OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5.json

There are more in that folder, ~40mb maybe?

natolambert commented 2 weeks ago

@albertvillanova here's a snippet so you don't need to click

{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        0
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.076171875
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        1
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.87890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        2
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.287109375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        3
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 1.6337890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        4
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 5.27734375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        5
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.0625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        6
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 2.29296875
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        7
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 6.77734375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        8
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.853515625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        9
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.86328125
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        10
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 2.890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        11
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.70703125
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        12
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.45703125
}
albertvillanova commented 2 weeks ago

Thanks again for your feedback, @natolambert.

However, strictly speaking, the last file is not in JSON format but in kind of JSON-Lines like format (although not properly either because there are multiple newline characters within each object). Not even pandas can read that file format.

Anyway, for JSON-Lines, I would expect that datasets and pandas have the same performance for JSON Lines files, as both use pyarrow under the hood...

A proper JSON file in records orient should be a list (a JSON array): the first character should be [.

Anyway, I am generating a JSON file from your JSON-Lines file to test performance.