Closed PhilipMay closed 3 years ago
Hi @PhilipMay
ELECTRADataProcessor
will read that long string and split it by \n
into sentences and clear empty sentences by defaultELECTRADataProcessor
sequentially concatenate all sentences from the same document in the same way the official one do it, into many "sample"s. ELECTRADataProcessor.map
take a raw huggingface dataset with document as row and output a preprocessed huggingface dataset which has "row" as sample, and each row is with columns "input_ids", "attention_mask", ....Feel free to tag me if you still have some questions.
Ahh I see. Many thanks.
As far as I can see you do not sentence split your input data for pretraining. Is that correct?
You have one document per "row" and just cut it when the sequence lenth of the model is reached. But how do you continue after that for the nest "sentence"? With the rest of the cut sentence?
Thanks Philip