Closed albertvillanova closed 2 weeks ago
Thanks! Feel free to ping me for examples. May not respond immediately because we're all busy but would like to help.
Hi @natolambert, could you please give some examples of JSON files to benchmark?
Please note that this JSON file (https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set-scores/Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback.json) is not in "records" orient; instead it has the following structure:
{
"chat_template": "tulu",
"id": [30, 34, 35,...],
"model": "Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback",
"model_type": "Seq. Classifier",
"results": [1, 1, 1, ...],
"scores_chosen": [4.421875, 1.8916015625, 3.8515625,...],
"scores_rejected": [-2.416015625, -1.47265625, -0.9912109375,...],
"subset": ["alpacaeval-easy", "alpacaeval-easy", "alpacaeval-easy",...]
"text_chosen": ["<s>[INST] How do I detail a...",...],
"text_rejected": ["<s>[INST] How do I detail a...",...]
}
Note that "records" orient should be a list (not a dict) with each row as one item of the list:
[
{"chat_template": "tulu", "id": 30,... },
{"chat_template": "tulu", "id": 34,... },
...
]
We use a mix (which is a mess), here's an example with the records orient https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/best-of-n/alpaca_eval/tulu-13b/OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5.json
There are more in that folder, ~40mb maybe?
@albertvillanova here's a snippet so you don't need to click
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
0
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.076171875
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
1
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.87890625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
2
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.287109375
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
3
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 1.6337890625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
4
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 5.27734375
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
5
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.0625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
6
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 2.29296875
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
7
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 6.77734375
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
8
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.853515625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
9
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 4.86328125
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
10
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 2.890625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
11
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 4.70703125
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
12
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 4.45703125
}
Thanks again for your feedback, @natolambert.
However, strictly speaking, the last file is not in JSON format but in kind of JSON-Lines like format (although not properly either because there are multiple newline characters within each object). Not even pandas can read that file format.
Anyway, for JSON-Lines, I would expect that datasets
and pandas
have the same performance for JSON Lines files, as both use pyarrow
under the hood...
A proper JSON file in records orient should be a list (a JSON array): the first character should be [
.
Anyway, I am generating a JSON file from your JSON-Lines file to test performance.
As reported by @natolambert, loading regular JSON files with
datasets
shows poor performance.The cause is that we use the
json
Python standard library instead of other faster libraries. See my old comment: https://github.com/huggingface/datasets/pull/2638#pullrequestreview-706983714I remember having a discussion about this and it was decided that it was better not to include an additional dependency on a 3rd-party library.
However:
pandas
andpandas
depends onujson
: so we have an indirect dependency onujson
ujson
as an optional extra dependency, and check at runtime if it is installed to decide which library to use, either json or ujson