load_dataset does not load all of the data in my input file

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Apache License 2.0

18.99k stars 2.63k forks source link

Describe the bug

I have 127 elements in my input dataset. When I do a len on the dataset after loaded, it is only 124 elements.

Steps to reproduce the bug

train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.VALIDATION)
logger.info(len(train_dataset))
logger.info(len(valid_dataset))

Both train and valid input are 127 items. However, they both only load 124 items. The input format is in json. At the end of the day, I am trying to create .pt files.

Expected behavior

I see all 127 elements in my dataset when performing len

Environment info

Python 3.10. CentOS operating system. nlp==0.40, datasets==2.14.5, transformers==4.26.1

huggingface / datasets