huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.21k stars 2.68k forks source link

[chinese wwm] load_datasets behavior not as expected when using run_mlm_wwm.py script #3411

Open hyusterr opened 2 years ago

hyusterr commented 2 years ago

Describe the bug

Model I am using (Bert, XLNet ...): bert-base-chinese

The problem arises when using:

The tasks I am working on is: pretraining whole word masking with my own dataset and ref.json file I tried follow the run_mlm_wwm.py procedure to do whole word masking on pretraining task. my file is in .txt form, where one line represents one sample, with 9,264,784 chinese lines in total. the ref.json file is also contains 9,264,784 lines of whole word masking reference data for my chinese corpus. but when I try to adapt the run_mlm_wwm.py script, it shows that somehow after datasets["train"] = load_dataset(... len(datasets["train"]) returns 9,265,365 then, after tokenized_datasets = datasets.map(... len(tokenized_datasets["train"]) returns 9,265,279 I'm really confused and tried to trace code by myself but can't know what happened after a week trial.

I want to know what happened in the load_dataset() function and datasets.map here and how did I get more lines of data than I input. so I'm here to ask.

To reproduce

Sorry that I can't provide my data here since it did not belong to me. but I'm sure I remove the blank lines.

Expected behavior

I expect the code run as it should. but the AssertionError in line 167 keeps raise as the line of reference json and datasets['train'] differs.

Thanks for your patient reading!

Environment info

hyusterr commented 2 years ago

@LysandreJik not so sure who to @ Could you help?

LysandreJik commented 2 years ago

Hi @hyusterr, I believe it is @wlhgtc from https://github.com/huggingface/transformers/pull/9887