Closed hyusterr closed 2 years ago
This script is not an actively maintained example, so you should ping the original contributor for any question on it :-)
@julien-c
@hyusterr Sorry for late, I met the same question.
The file load by load_dataset ()
is kind of diff from norm way, so just run the check as follows:
for line in data:
assert len(line.splitlines()) == 1
It works for me, hope it could help.
thanks for your information! I will try it.
I found that it's "^]" making the load_dataset to split more lines than expected, e.g.
7-ELEVEN/ 提供 分享 統一 企業 董事長 羅智先 28日 表示 , 統一 超 去年 開 無 人 商店 「 X-STORE 」 , 並 不 是 要 節省 人力 成> 本 , 「 而是 預防 未來 台灣 找 不 到 服務 人員 」 ; 外界 關心 統一 超 有 很多 沒有 24 小時 經營 的 門市 , 他 說 , 「 我們 有 部 分 商店 沒 24 小時 營業 的 條件 , 其實 一 天 16 小時 都 很 夠 了 ^] 」 。
Thanks for your imformation!
I also meet the same problem. Hoping you can fix the code sometime. @wlhgtc
simply splitlines()
can handle the problem
Thanks for your information.
Environment info
transformers
version: 4.7.0Who can help
@LysandreJik @sgugger
Information
Model I am using (Bert, XLNet ...): bert-base-chinese
The problem arises when using:
rum_mlm_wwm.py
The tasks I am working on is: pretraining whole word masking with my own dataset and ref.json file I tried follow the run_mlm_wwm.py procedure to do whole word masking on pretraining task. my file is in .txt form, where one line represents one sample, with
9,264,784
chinese lines in total. the ref.json file is also contains 9,264,784 lines of whole word masking reference data for my chinese corpus. but when I try to adapt the run_mlm_wwm.py script, it shows that somehow afterdatasets["train"] = load_dataset(...
len(datasets["train"])
returns9,265,365
then, aftertokenized_datasets = datasets.map(...
len(tokenized_datasets["train"])
returns9,265,279
I'm really confused and tried to trace code by myself but can't know what happened after a week trial.I want to know what happened in the
load_dataset()
function anddatasets.map
here and how did I get more lines of data than I input. so I'm here to ask.To reproduce
Sorry that I can't provide my data here since it did not belong to me. but I'm sure I remove the blank lines.
Expected behavior
I expect the code run as it should. but the AssertionError in line 167 keeps raise as the line of reference json and datasets['train'] differs.
Thanks for your patient reading!