The design of your adjacency matrix adj_mats_orig and the way you split the train/test set will cause a huge data leakage problem in your training, because your validation and test set is created independently for gene_adj and gene_adj.transpose(copy=True), and therefore the edges from the validation / test set in gene_adj is actually included in the training set of gene_adj.transpose(copy=True).
Same problem goes for the train / validate set between gene_drug_adj and drug_gene_adj. The validation edges from gene_drug_adj are actually used for training in drug_gene_adj, and vise versa.
Hello @hurleyLi , I have the same problem as you at first, but now I think this is not a big problem because what we want to predict is between drug nodes, which means p-p and p-d edge doesn't matter
The design of your adjacency matrix
adj_mats_orig
and the way you split the train/test set will cause a huge data leakage problem in your training, because your validation and test set is created independently forgene_adj
andgene_adj.transpose(copy=True)
, and therefore the edges from the validation / test set ingene_adj
is actually included in the training set ofgene_adj.transpose(copy=True)
.Same problem goes for the train / validate set between
gene_drug_adj
anddrug_gene_adj
. The validation edges fromgene_drug_adj
are actually used for training indrug_gene_adj
, and vise versa.Could you please clarify? Thanks!
Originally posted by @hurleyLi in https://github.com/marinkaz/decagon/issues/7#issuecomment-519645774