NonMatchingSplitsSizesError from huggingface dataset for POJ-104

bstee615 commented 1 year ago

Hello, I get this error when I try to load your POJ-104 dataset from huggingface.

NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=18878686, num_examples=32000, dataset_name='code_x_glue_cc_clone_detection_poj104'), 'recorded': SplitInfo(name='train', num_bytes=20179075, num_examples=32500, dataset_name='code_x_glue_cc_clone_detection_poj104')}, {'expected': SplitInfo(name='validation', num_bytes=5765303, num_examples=8000, dataset_name='code_x_glue_cc_clone_detection_poj104'), 'recorded': SplitInfo(name='validation', num_bytes=6382433, num_examples=8500, dataset_name='code_x_glue_cc_clone_detection_poj104')}]

As far as I can tell, the dataset expects to load 500 fewer examples than the downloaded files contain. I attached a notebook which reproduces the issue:

Notebook (run with Python 3.8): test_poj104.zip
Output: test_poj104.pdf

Could you fix the issue so that we can load the dataset without ignore_verifications=True?

celbree commented 1 year ago

Hi, our datasets in huggingface are not maintained by us. It's recommended to follow our instructions for each task.

bstee615 commented 1 year ago

Ok, thanks for the info. Can you refer me to who maintains the huggingface datasets?

microsoft / CodeXGLUE

NonMatchingSplitsSizesError from huggingface dataset for POJ-104 #135