richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
325 stars 42 forks source link

Is it right that your input data is not sentence splitted? #15

Closed PhilipMay closed 3 years ago

PhilipMay commented 3 years ago

As far as I can see you do not sentence split your input data for pretraining. Is that correct?

You have one document per "row" and just cut it when the sequence lenth of the model is reached. But how do you continue after that for the nest "sentence"? With the rest of the cut sentence?

Thanks Philip

richarddwang commented 3 years ago

Hi @PhilipMay

https://github.com/richarddwang/electra_pytorch/blob/f4940c73359841d231f3fac2f2f400247afcd04d/_utils/utils.py#L137-L140

Feel free to tag me if you still have some questions.

PhilipMay commented 3 years ago

Ahh I see. Many thanks.