google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.23k stars 571 forks source link

[Implementation] Sentence order prediction (SOP) label for a single-chunk-document in create_pretraining_data.py #234

Open joongbo opened 3 years ago

joongbo commented 3 years ago

Thanks for the great work.

I have a question about the gap between the paper's report and the released code for the sentence order prediction (SOP) task. Actually, the code for SOP seems to contain NSP, I think.

Section 3.1 in the ALBERT paper says that SOP can solve NSP (next sentence prediction) to a reasonable degree (as in Table 5, Section 4.6). Whereas the paper says SOP uses only consecutive sentences, the released code contains a random document selection procedure.

The problem I think is sentence_order_label in create_pretraining_data.py for a document with a single chunk. In line 315-7, this code randomly selects the other document for handling len(current_chunk) == 1 and set is_random_next = True (which means sentence_order_label = 1). This label is not for a truely reveresed order of consecutive sentences (as in SOP) but for NSP.

Is there any misunderstanding in my question? If not, is there any difference in the version of the released code with the paper?

Or, is this the best practice for handling single-chunk-document?

Thanks.