microsoft / CodeXGLUE

CodeXGLUE
MIT License
1.51k stars 363 forks source link

NonMatchingSplitsSizesError from huggingface dataset for POJ-104 #135

Open bstee615 opened 1 year ago

bstee615 commented 1 year ago

Hello, I get this error when I try to load your POJ-104 dataset from huggingface.

NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=18878686, num_examples=32000, dataset_name='code_x_glue_cc_clone_detection_poj104'), 'recorded': SplitInfo(name='train', num_bytes=20179075, num_examples=32500, dataset_name='code_x_glue_cc_clone_detection_poj104')}, {'expected': SplitInfo(name='validation', num_bytes=5765303, num_examples=8000, dataset_name='code_x_glue_cc_clone_detection_poj104'), 'recorded': SplitInfo(name='validation', num_bytes=6382433, num_examples=8500, dataset_name='code_x_glue_cc_clone_detection_poj104')}]

As far as I can tell, the dataset expects to load 500 fewer examples than the downloaded files contain. I attached a notebook which reproduces the issue:

Could you fix the issue so that we can load the dataset without ignore_verifications=True?

celbree commented 1 year ago

Hi, our datasets in huggingface are not maintained by us. It's recommended to follow our instructions for each task.

bstee615 commented 1 year ago

Ok, thanks for the info. Can you refer me to who maintains the huggingface datasets?