About corpus processing

SparkJiao commented 1 year ago

Wonderful work and thanks very much for your contribution!

I'm running the step 3.1 of corpus processing through the following command:

bash scripts/tools/process_full_doc_data_gpt2.sh ${BASE_PATH}

I find that the processing process no longer give more log information after around twenty minutes:

19709610it [19:42, 18230.21it/s]Processed 19710000 documents. 9454301 instances. (16597.94547639502 docs/s, 34.033233214932274 MB/s).                                                              │
19718875it [19:43, 18330.23it/s]Processed 19720000 documents. 9459066 instances. (16598.907395423554 docs/s, 34.03512950803848 MB/s).                                                              │
19728461it [19:43, 18376.47it/s]Processed 19730000 documents. 9463854 instances. (16599.64901102309 docs/s, 34.03669121565245 MB/s).                                                               │
19739462it [19:44, 17707.10it/s]Processed 19740000 documents. 9468848 instances. (16600.24219375737 docs/s, 34.03864547491934 MB/s).                                                               │
19748973it [19:44, 18699.86it/s]Processed 19750000 documents. 9473540 instances. (16601.189474271574 docs/s, 34.04018463198436 MB/s).                                                              │
19758431it [19:45, 18651.21it/s]Processed 19760000 documents. 9478350 instances. (16602.039709078395 docs/s, 34.04187178813378 MB/s).                                                              │
19769395it [19:45, 17421.72it/s]Processed 19770000 documents. 9483334 instances. (16602.404584414402 docs/s, 34.043298951159414 MB/s).                                                             │
19778868it [19:46, 18743.90it/s]Processed 19780000 documents. 9487918 instances. (16603.415924520774 docs/s, 34.04454555746209 MB/s).                                                              │
19788423it [19:46, 18712.33it/s]Processed 19790000 documents. 9492556 instances. (16604.34182568274 docs/s, 34.045864411408964 MB/s).                                                              │
19798152it [19:47, 18368.46it/s]Processed 19800000 documents. 9497198 instances. (16605.313467538712 docs/s, 34.04731825175269 MB/s).                                                              │
19809461it [19:47, 18160.84it/s]Processed 19810000 documents. 9501964 instances. (16605.904201172 docs/s, 34.0485105334241 MB/s).                                                                  │
19818622it [19:48, 17941.52it/s]Processed 19820000 documents. 9506844 instances. (16606.58279869479 docs/s, 34.05018079147349 MB/s).                                                               │
19828137it [19:48, 18782.17it/s]Processed 19830000 documents. 9511495 instances. (16607.542067494815 docs/s, 34.05161481942658 MB/s).                                                              │
19839321it [19:49, 18386.12it/s]Processed 19840000 documents. 9516243 instances. (16608.263384575705 docs/s, 34.05300482773716 MB/s).                                                              │
19848463it [19:50, 18003.45it/s]Processed 19850000 documents. 9521121 instances. (16608.87504260874 docs/s, 34.05464369558365 MB/s).

Now the program has run exceeding 10 hours. It seems that it is saving the index file. When I use ls -l command to check the output file, I find that it is still being writting:

(base) fangkai@scsehg:~/PICL/pretrain_data/full_doc/gpt2$ ls -l
total 19070032
-rw-rw-r-- 1 fangkai fangkai 19507249152 Jul 26 12:27 train_lm_0.bin
-rw-rw-r-- 1 fangkai fangkai    20447232 Jul 26 00:29 valid_lm_0.bin

BTW, the program was started last night. May I know how long usually will the processing procedure last? And would you mind releasing the processed data to Huggingface?

Thanks very much!

Best, Fangkai

t1101675 commented 1 year ago

Hi, thanks for your attention to our work!

For data processing, the process may be stuck by many possible problems like running out of memory or being blocked by the disk writing speed...

Therefore, we usually split the large corpus file into small ones (like < 1GB) and process them one by one, which lowers the risk of running into problems, and finally merge the .bin and .idx files. We add the related instructions to this section and hope it will be helpful.

By the way, it is a good idea to release the processed data in HuggingFace. We are working on it.

t1101675 commented 1 year ago

We have released the processed .bin and .idx data in HuggingFace. For information can be found in README.

SparkJiao commented 1 year ago

Appreciate your help very much!

Seems that the huggingface dataset repo cannot be accessed now. Is it still private? I didn't see similar repos on your HF personal page.

But anyway, I can finish the preprocessing steps by splitting data into smaller ones. It seems that the previous problem is caused by there are some extremely long documents causing error when tokenization. I think this may because the first-step preprocessing have merged multiple documents into single one. (The spliting logic, i.e., use \n\n or \n varies differently with wikipedia and bookcorpus`) Currently I just ignored these documents.

Thanks very much!

t1101675 commented 1 year ago

Oh, that's my mistake. It is public now.

SparkJiao commented 1 year ago

Got it. Thanks!

thu-coai / PICL

About corpus processing #5