Improve performance of JSON loader

albertvillanova commented 3 weeks ago

As reported by @natolambert, loading regular JSON files with datasets shows poor performance.

The cause is that we use the json Python standard library instead of other faster libraries. See my old comment: https://github.com/huggingface/datasets/pull/2638#pullrequestreview-706983714

There are benchmarks that compare different JSON packages, with the Standard Library one among the worst performant:

https://github.com/ultrajson/ultrajson#benchmarks

https://github.com/ijl/orjson#performance

I remember having a discussion about this and it was decided that it was better not to include an additional dependency on a 3rd-party library.

However:

We already depend on pandas and pandas depends on ujson: so we have an indirect dependency on ujson
Even if the above were not the case, we always could include ujson as an optional extra dependency, and check at runtime if it is installed to decide which library to use, either json or ujson

natolambert commented 3 weeks ago

Thanks! Feel free to ping me for examples. May not respond immediately because we're all busy but would like to help.

albertvillanova commented 3 weeks ago

Hi @natolambert, could you please give some examples of JSON files to benchmark?

Please note that this JSON file (https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set-scores/Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback.json) is not in "records" orient; instead it has the following structure:

{
  "chat_template": "tulu",
  "id": [30, 34, 35,...],
  "model": "Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback",
  "model_type": "Seq. Classifier",
  "results": [1, 1, 1, ...],
  "scores_chosen": [4.421875, 1.8916015625, 3.8515625,...],
  "scores_rejected": [-2.416015625, -1.47265625, -0.9912109375,...],
  "subset": ["alpacaeval-easy", "alpacaeval-easy", "alpacaeval-easy",...]
  "text_chosen": ["<s>[INST] How do I detail a...",...],
  "text_rejected": ["<s>[INST] How do I detail a...",...]
}

Note that "records" orient should be a list (not a dict) with each row as one item of the list:

[
  {"chat_template": "tulu", "id": 30,... },
  {"chat_template": "tulu", "id": 34,... },
  ...
]

natolambert commented 2 weeks ago

We use a mix (which is a mess), here's an example with the records orient https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/best-of-n/alpaca_eval/tulu-13b/OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5.json

There are more in that folder, ~40mb maybe?

natolambert commented 2 weeks ago

@albertvillanova here's a snippet so you don't need to click

{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        0
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.076171875
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        1
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.87890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        2
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.287109375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        3
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 1.6337890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        4
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 5.27734375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        5
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.0625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        6
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 2.29296875
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        7
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 6.77734375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        8
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.853515625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        9
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.86328125
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        10
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 2.890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        11
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.70703125
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        12
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.45703125
}

albertvillanova commented 2 weeks ago

Thanks again for your feedback, @natolambert.

However, strictly speaking, the last file is not in JSON format but in kind of JSON-Lines like format (although not properly either because there are multiple newline characters within each object). Not even pandas can read that file format.

Anyway, for JSON-Lines, I would expect that datasets and pandas have the same performance for JSON Lines files, as both use pyarrow under the hood...

A proper JSON file in records orient should be a list (a JSON array): the first character should be [.

Anyway, I am generating a JSON file from your JSON-Lines file to test performance.

huggingface / datasets

Improve performance of JSON loader #6867