Is it right that your input data is not sentence splitted?

PhilipMay commented 3 years ago

As far as I can see you do not sentence split your input data for pretraining. Is that correct?

You have one document per "row" and just cut it when the sequence lenth of the model is reached. But how do you continue after that for the nest "sentence"? With the rest of the cut sentence?

Thanks Philip

richarddwang commented 3 years ago

Hi @PhilipMay

https://github.com/richarddwang/electra_pytorch/blob/f4940c73359841d231f3fac2f2f400247afcd04d/_utils/utils.py#L137-L140

Every "row" in the original (raw) huggingface dataset contains a column 'text', which is a document (a very long python string)
ELECTRADataProcessor will read that long string and split it by \n into sentences and clear empty sentences by default
ELECTRADataProcessor sequentially concatenate all sentences from the same document in the same way the official one do it, into many "sample"s.
In conclusion, ELECTRADataProcessor.map take a raw huggingface dataset with document as row and output a preprocessed huggingface dataset which has "row" as sample, and each row is with columns "input_ids", "attention_mask", ....

Feel free to tag me if you still have some questions.

PhilipMay commented 3 years ago

Ahh I see. Many thanks.

richarddwang / electra_pytorch

Is it right that your input data is not sentence splitted? #15