utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.85k stars 342 forks source link

when tokenizing text, added a progress bar and improved speed #255

Closed godefv closed 3 years ago

godefv commented 3 years ago

I have been trying to load about 300M of text using BertLMDataBunch.from_raw_corpus(). After more than 10 hours at 100% CPU usage, I was still stuck with a progress bar at 100%.

So, I thought that there was a bug.

After digging in the code, I activated INFO logging and found where the code got stuck : tokenization of the text. I simply decided to add a progress bar, so that I would at least get some feedback during all these hours. And it turned out that not only I now have a progress bar, but the tokenization now only takes a few minutes.

godefv commented 3 years ago

I had forgotten to remove the last incomplete batch. I have fixed it.

kaushaltrivedi commented 3 years ago

merged