[chinese wwm] load_datasets behavior not as expected when using run_mlm_wwm.py script

hyusterr commented 2 years ago

Environment info

transformers version: 4.7.0
Platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.29
Python version: 3.8.10
PyTorch version (GPU?): 1.10.0+cu102 (True)
Tensorflow version (GPU?): 2.4.0 (True)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik @sgugger

Information

Model I am using (Bert, XLNet ...): bert-base-chinese

The problem arises when using:

[https://github.com/huggingface/transformers/blob/master/examples/research_projects/mlm_wwm/run_mlm_wwm.py] the official example scripts: rum_mlm_wwm.py

The tasks I am working on is: pretraining whole word masking with my own dataset and ref.json file I tried follow the run_mlm_wwm.py procedure to do whole word masking on pretraining task. my file is in .txt form, where one line represents one sample, with 9,264,784 chinese lines in total. the ref.json file is also contains 9,264,784 lines of whole word masking reference data for my chinese corpus. but when I try to adapt the run_mlm_wwm.py script, it shows that somehow after datasets["train"] = load_dataset(... len(datasets["train"]) returns 9,265,365 then, after tokenized_datasets = datasets.map(... len(tokenized_datasets["train"]) returns 9,265,279 I'm really confused and tried to trace code by myself but can't know what happened after a week trial.

I want to know what happened in the load_dataset() function and datasets.map here and how did I get more lines of data than I input. so I'm here to ask.

To reproduce

Sorry that I can't provide my data here since it did not belong to me. but I'm sure I remove the blank lines.

Expected behavior

I expect the code run as it should. but the AssertionError in line 167 keeps raise as the line of reference json and datasets['train'] differs.

Thanks for your patient reading!

sgugger commented 2 years ago

This script is not an actively maintained example, so you should ping the original contributor for any question on it :-)

hyusterr commented 2 years ago

@julien-c

wlhgtc commented 2 years ago

@hyusterr Sorry for late, I met the same question. The file load by load_dataset () is kind of diff from norm way, so just run the check as follows:

for line in data:
    assert len(line.splitlines()) == 1

It works for me, hope it could help.

hyusterr commented 2 years ago

thanks for your information! I will try it.

hyusterr commented 2 years ago

I found that it's "^]" making the load_dataset to split more lines than expected, e.g.

7-ELEVEN／提供分享統一企業董事長羅智先 28日表示，統一超去年開無人商店「 X-STORE 」，並不是要節省人力成> 本，「而是預防未來台灣找不到服務人員」；外界關心統一超有很多沒有 24 小時經營的門市，他說，「我們有部分商店沒 24 小時營業的條件，其實一天 16 小時都很夠了 ^] 」。

Thanks for your imformation!

renxingkai commented 2 years ago

I also meet the same problem. Hoping you can fix the code sometime. @wlhgtc

hyusterr commented 2 years ago

simply splitlines() can handle the problem

renxingkai commented 2 years ago

Thanks for your information.

huggingface / transformers