Closed SparkJiao closed 1 year ago
Hi, thanks for your attention to our work!
For data processing, the process may be stuck by many possible problems like running out of memory or being blocked by the disk writing speed...
Therefore, we usually split the large corpus file into small ones (like < 1GB) and process them one by one, which lowers the risk of running into problems, and finally merge the .bin
and .idx
files. We add the related instructions to this section and hope it will be helpful.
By the way, it is a good idea to release the processed data in HuggingFace. We are working on it.
We have released the processed .bin
and .idx
data in HuggingFace. For information can be found in README.
Appreciate your help very much!
Seems that the huggingface dataset repo cannot be accessed now. Is it still private? I didn't see similar repos on your HF personal page.
But anyway, I can finish the preprocessing steps by splitting data into smaller ones. It seems that the previous problem is caused by there are some extremely long documents causing error when tokenization. I think this may because the first-step preprocessing have merged multiple documents into single one. (The spliting logic, i.e., use \n\n
or \n
varies differently with wikipedia and bookcorpus`) Currently I just ignored these documents.
Thanks very much!
Oh, that's my mistake. It is public now.
Got it. Thanks!
Wonderful work and thanks very much for your contribution!
I'm running the step 3.1 of corpus processing through the following command:
I find that the processing process no longer give more log information after around twenty minutes:
Now the program has run exceeding 10 hours. It seems that it is saving the index file. When I use
ls -l
command to check the output file, I find that it is still being writting:BTW, the program was started last night. May I know how long usually will the processing procedure last? And would you mind releasing the processed data to Huggingface?
Thanks very much!
Best, Fangkai