koursaros-ai / nboost

NBoost is a scalable, search-api-boosting platform for deploying transformer models to improve the relevance of search results on different platforms (i.e. Elasticsearch)
Apache License 2.0
675 stars 69 forks source link

Biobert ms marco trained on full dataset, or medical subset? #76

Closed Santosh-Gupta closed 4 years ago

Santosh-Gupta commented 4 years ago

MS Marco has a medical subset ( here https://github.com/Georgetown-IR-Lab/covid-neural-ir/blob/master/med-msmarco-train.txt )

I was wondering if the Biobert version was trained on the full msmarco dataset, or only the medical subset?

pertschuk commented 4 years ago

I built my own medical subset before the one you linked was released

https://github.com/koursaros-ai/MSMarco-bio

But I found that the biobert model trained on a large portion of the general dataset did just as well

Santosh-Gupta commented 4 years ago

Will you be releasing the bioBert trained on the general dataset as well?