I have been trying to load about 300M of text using BertLMDataBunch.from_raw_corpus(). After more than 10 hours at 100% CPU usage, I was still stuck with a progress bar at 100%.
So, I thought that there was a bug.
After digging in the code, I activated INFO logging and found where the code got stuck : tokenization of the text.
I simply decided to add a progress bar, so that I would at least get some feedback during all these hours. And it turned out that not only I now have a progress bar, but the tokenization now only takes a few minutes.
I have been trying to load about 300M of text using
BertLMDataBunch.from_raw_corpus()
. After more than 10 hours at 100% CPU usage, I was still stuck with a progress bar at 100%.So, I thought that there was a bug.
After digging in the code, I activated
INFO
logging and found where the code got stuck : tokenization of the text. I simply decided to add a progress bar, so that I would at least get some feedback during all these hours. And it turned out that not only I now have a progress bar, but the tokenization now only takes a few minutes.