greenelab / snorkeling

Extracting biomedical relationships from literature with Snorkel 🏊
Other
59 stars 17 forks source link

Added new stratification strategy #41

Closed danich1 closed 6 years ago

danich1 commented 6 years ago

This PR contains code for our new stratified sampling strategy. This approach calculates a ranking for each disease-gene pair grouped by its presence in hetionet and pubmed. From this ranking one can extract certain pairs based on the desired split size for training, dev and testing sets. The bottleneck here is updating the rows of the candidate table. Takes ~3 hrs.