google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.23k stars 571 forks source link

Probable error on line 306 in `create_pretraining_data.py` for albert #256

Open wjdghks950 opened 2 years ago

wjdghks950 commented 2 years ago

https://github.com/google-research/albert/blob/932b41f0319fbef7efd069d5ff545e3358574e19/create_pretraining_data.py#L306

In line 306, there is appears to be a probable issue.

For random.randint(start, end), the method is end-inclusive.

So, when len(current_chunk) == 2, line 309 would stop at a single iteration.

While this may allow the model to incorporate the single leftover chunk (if it were to be enter the first elif statement in line 339), it will leave the single chunk out of training instances.

Please address this issue.