Closed PhilipMay closed 3 years ago
This is because huggingface's tokenizer internally records maximum length its corresponding model but not itself can deal with, and check the length when tokenize. So this should be able to be ignored.
As how to avoid it, you can ask on Huggingface's forum.
So this should be able to be ignored.
So the batches you create are not more than the 512 token in length - right?
Yes, no more than max length, which is 128 under small scale.
I am using
ELECTRADataProcessor
to tokenize my corpus for pretraining (like your example sais).I am getting this following message:
My question: Can this be ignored because the tokenizer cuts off the text or will it cause a crash when training? How can that be avoided?
Thanks again Philip