Closed yuanyehome closed 8 months ago
Update: When I called
curl https://datasets-server.huggingface.co/statistics\?dataset\=${dataset_path}\&config\=default\&split\=test \ ─╯
-X GET \
-H "Authorization: Bearer ${API_TOKEN}"
I will get the correct number_examples
.
But when I called
curl https://datasets-server.huggingface.co/size\?dataset\=${dataset_path} \ ─╯
-X GET \
-H "Authorization: Bearer ${API_TOKEN}"
The num_rows
attribute gives the wrong number.
Update: I've checked the API source code and recent commit, and it seems that this function has be changed yesterday: https://github.com/huggingface/datasets-server/blob/476c564b218c98eb9005ecd73c81d62d2d2f2563/services/worker/src/worker/job_runners/config/parquet_and_info.py#L666-L737.
The num_examples
is miscalculated. I think that maybe #L684 should be in the for-loop? (I'm not sure about this because I'm not familiar with this code base).
If anyone can help to check this? @severo @polinaeterna
wow, well spotted, thanks!
Should be fixed by https://github.com/huggingface/datasets-server/pull/2582
thanks a lot @yuanyehome for the report and for finding the root cause
As stated in this post: https://discuss.huggingface.co/t/got-wrong-row-number-of-dataset-viewer/77132, I got the wrong number on the dataset repo page. The numbers are correct if I download the repo, but when I follow the instructions in https://huggingface.co/docs/datasets-server/size?code=curl, the numbers of rows in the response are also wrong. I'd like to know if there is something wrong with my usage or if there are some bugs in the API.