This PR contains code for our new stratified sampling strategy. This approach calculates a ranking for each disease-gene pair grouped by its presence in hetionet and pubmed. From this ranking one can extract certain pairs based on the desired split size for training, dev and testing sets. The bottleneck here is updating the rows of the candidate table. Takes ~3 hrs.
This PR contains code for our new stratified sampling strategy. This approach calculates a ranking for each disease-gene pair grouped by its presence in hetionet and pubmed. From this ranking one can extract certain pairs based on the desired split size for training, dev and testing sets. The bottleneck here is updating the rows of the candidate table. Takes ~3 hrs.