Closed kristjanArumae closed 5 years ago
Yes. We didn't keep track of PMIDs.
Just to clarify, this isn't for tracking the data. If you use create_pretraining_data.py to generate your pre-training data, document separation is used to generate next sentence prediction data.
thank you. we didn't insert blank lines between documents.
The pre-training data provided in July appears to be sentences concatenated together from all data, with no blank lines separating documents. Was this intended?