huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
702 stars 77 forks source link

The API returns the wrong row number #2581

Closed yuanyehome closed 8 months ago

yuanyehome commented 8 months ago

As stated in this post: https://discuss.huggingface.co/t/got-wrong-row-number-of-dataset-viewer/77132, I got the wrong number on the dataset repo page. The numbers are correct if I download the repo, but when I follow the instructions in https://huggingface.co/docs/datasets-server/size?code=curl, the numbers of rows in the response are also wrong. I'd like to know if there is something wrong with my usage or if there are some bugs in the API.

yuanyehome commented 8 months ago

Update: When I called

curl https://datasets-server.huggingface.co/statistics\?dataset\=${dataset_path}\&config\=default\&split\=test \                                                                    ─╯
        -X GET \
        -H "Authorization: Bearer ${API_TOKEN}"

I will get the correct number_examples.

But when I called

curl https://datasets-server.huggingface.co/size\?dataset\=${dataset_path} \                                                                                                        ─╯
        -X GET \
        -H "Authorization: Bearer ${API_TOKEN}"

The num_rows attribute gives the wrong number.

yuanyehome commented 8 months ago

Update: I've checked the API source code and recent commit, and it seems that this function has be changed yesterday: https://github.com/huggingface/datasets-server/blob/476c564b218c98eb9005ecd73c81d62d2d2f2563/services/worker/src/worker/job_runners/config/parquet_and_info.py#L666-L737.

The num_examples is miscalculated. I think that maybe #L684 should be in the for-loop? (I'm not sure about this because I'm not familiar with this code base).

If anyone can help to check this? @severo @polinaeterna

severo commented 8 months ago

wow, well spotted, thanks!

severo commented 8 months ago

Reports:

severo commented 8 months ago

related to https://github.com/huggingface/datasets-server/pull/2564

severo commented 8 months ago

Should be fixed by https://github.com/huggingface/datasets-server/pull/2582

severo commented 8 months ago

Fixed on https://huggingface.co/datasets/ab24g21/14to18

Capture d’écran 2024-03-13 à 11 56 32 Capture d’écran 2024-03-13 à 11 56 53
severo commented 8 months ago

thanks a lot @yuanyehome for the report and for finding the root cause