The API returns the wrong row number

huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.

https://huggingface.co/docs/dataset-viewer

Apache License 2.0

702 stars 77 forks source link

The API returns the wrong row number #2581

Closed yuanyehome closed 8 months ago

yuanyehome commented 8 months ago

As stated in this post: https://discuss.huggingface.co/t/got-wrong-row-number-of-dataset-viewer/77132, I got the wrong number on the dataset repo page. The numbers are correct if I download the repo, but when I follow the instructions in https://huggingface.co/docs/datasets-server/size?code=curl, the numbers of rows in the response are also wrong. I'd like to know if there is something wrong with my usage or if there are some bugs in the API.

yuanyehome commented 8 months ago

Update: When I called

curl https://datasets-server.huggingface.co/statistics\?dataset\=${dataset_path}\&config\=default\&split\=test \                                                                    ─╯
        -X GET \
        -H "Authorization: Bearer ${API_TOKEN}"

I will get the correct number_examples.

But when I called

curl https://datasets-server.huggingface.co/size\?dataset\=${dataset_path} \                                                                                                        ─╯
        -X GET \
        -H "Authorization: Bearer ${API_TOKEN}"

The num_rows attribute gives the wrong number.

yuanyehome commented 8 months ago

Update: I've checked the API source code and recent commit, and it seems that this function has be changed yesterday: https://github.com/huggingface/datasets-server/blob/476c564b218c98eb9005ecd73c81d62d2d2f2563/services/worker/src/worker/job_runners/config/parquet_and_info.py#L666-L737.

The num_examples is miscalculated. I think that maybe #L684 should be in the for-loop? (I'm not sure about this because I'm not familiar with this code base).

If anyone can help to check this? @severo @polinaeterna