richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
324 stars 41 forks source link

Sequence length too long for `ELECTRADataProcessor`. #22

Closed PhilipMay closed 3 years ago

PhilipMay commented 3 years ago

I am using ELECTRADataProcessor to tokenize my corpus for pretraining (like your example sais).

I am getting this following message:

Token indices sequence length is longer than the specified maximum sequence length for this model (642 > 512). Running this sequence through the model will result in indexing errors

My question: Can this be ignored because the tokenizer cuts off the text or will it cause a crash when training? How can that be avoided?

Thanks again Philip

richarddwang commented 3 years ago

This is because huggingface's tokenizer internally records maximum length its corresponding model but not itself can deal with, and check the length when tokenize. So this should be able to be ignored.

As how to avoid it, you can ask on Huggingface's forum.

PhilipMay commented 3 years ago

So this should be able to be ignored.

So the batches you create are not more than the 512 token in length - right?

richarddwang commented 3 years ago

Yes, no more than max length, which is 128 under small scale.