huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.99k stars 2.63k forks source link

load_dataset does not load all of the data in my input file #6432

Open demongolem-biz2 opened 10 months ago

demongolem-biz2 commented 10 months ago

Describe the bug

I have 127 elements in my input dataset. When I do a len on the dataset after loaded, it is only 124 elements.

Steps to reproduce the bug

train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.VALIDATION)
logger.info(len(train_dataset))
logger.info(len(valid_dataset))

Both train and valid input are 127 items. However, they both only load 124 items. The input format is in json. At the end of the day, I am trying to create .pt files.

Expected behavior

I see all 127 elements in my dataset when performing len

Environment info

Python 3.10. CentOS operating system. nlp==0.40, datasets==2.14.5, transformers==4.26.1

mariosasko commented 9 months ago

You should use datasets.load_dataset instead of nlp.load_dataset, as the nlp package is outdated.

If switching to datasets.load_dataset doesn't fix the issue, sharing the JSON file (feel free to replace the data with dummy data) would be nice so that we can reproduce it ourselves.