BERT pre-training corpus document delimiters.

ncbi-nlp / BLUE_Benchmark

BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora.

https://arxiv.org/abs/1906.05474

Other

286 stars 40 forks source link

BERT pre-training corpus document delimiters. #7

Closed kristjanArumae closed 5 years ago

kristjanArumae commented 5 years ago

The pre-training data provided in July appears to be sentences concatenated together from all data, with no blank lines separating documents. Was this intended?

yfpeng commented 5 years ago

Yes. We didn't keep track of PMIDs.

kristjanArumae commented 5 years ago

Just to clarify, this isn't for tracking the data. If you use create_pretraining_data.py to generate your pre-training data, document separation is used to generate next sentence prediction data.

yfpeng commented 5 years ago

thank you. we didn't insert blank lines between documents.