Open melmasri opened 4 years ago
@maxfarrell
@maxfarrell I don't think there is anything wrong with this issue. The topPairs are not in the database, see the following code, and please confirm
data(gmpd)
data(mammal_supertree)
gmpd <- gmpd[grep("sp[.]",gmpd$ParasiteCorrectedName, invert=TRUE),]
gmpd <- gmpd[grep("not identified",gmpd$ParasiteCorrectedName, invert=TRUE),]
# Subsetting host family Bovidae
gmpd <- gmpd[gmpd$HostFamily=="Bovidae",]
# Formatting host names to match phylogeny
gmpd$HostCorrectedName <- gsub(" ","_", gmpd$HostCorrectedName)
# Creating binary interaction matrix
com <- table(gmpd$HostCorrectedName, gmpd$ParasiteCorrectedName)
com[com>1] <- 1
com <- as.matrix(unclass(com), nrow=nrow(com), ncol=ncol(com))
# loading phylogeny and pruning to hosts in the interaction matrix
mammal_supertree <- drop.tip(mammal_supertree, mammal_supertree$tip.label[!mammal_supertree$tip.label%in%rownames(com)])
out <- network_est(Z = com, tree = mammal_supertree, slices = 1000, model.type = 'distance')
P <- sample_parameter(param=out$param, MODEL="distance", Z=out$Z, tree=out$tree, size = 500)
a = topPairs(P, out$Z)
a
for(i in 1:nrow(a))
print(paste(a[i,1], a[i,2], 'interaction exist', out$Z[a[i,1], a[i,2]]==1))
> > > Hosts Parasites p
9555 Nanger_granti Haemonchus contortus 0.1351618
9561 Cephalophus_nigrifrons Haemonchus contortus 0.1300700
9527 Connochaetes_gnou Haemonchus contortus 0.1258669
9542 Rupicapra_pyrenaica Haemonchus contortus 0.1255974
9536 Capra_hircus Haemonchus contortus 0.1253233
9538 Capra_nubiana Haemonchus contortus 0.1253233
9540 Capra_sibirica Haemonchus contortus 0.1253233
9558 Cephalophus_dorsalis Haemonchus contortus 0.1250568
9559 Cephalophus_silvicultor Haemonchus contortus 0.1250568
9552 Gazella_gazella Haemonchus contortus 0.1248039
> + [1] "Nanger_granti Haemonchus contortus interaction exist FALSE"
[1] "Cephalophus_nigrifrons Haemonchus contortus interaction exist FALSE"
[1] "Connochaetes_gnou Haemonchus contortus interaction exist FALSE"
[1] "Rupicapra_pyrenaica Haemonchus contortus interaction exist FALSE"
[1] "Capra_hircus Haemonchus contortus interaction exist FALSE"
[1] "Capra_nubiana Haemonchus contortus interaction exist FALSE"
[1] "Capra_sibirica Haemonchus contortus interaction exist FALSE"
[1] "Cephalophus_dorsalis Haemonchus contortus interaction exist FALSE"
[1] "Cephalophus_silvicultor Haemonchus contortus interaction exist FALSE"
[1] "Gazella_gazella Haemonchus contortus interaction exist FALSE"
>
@maxfarrell have you had the chance to look into this? or should I close it?
Here is a reproducible example - you can see the two single host parasites are given a probability of 1 in the output. The same behaviour occurs when the model is run with cv and min.per.col is greater than 1.
# simulated SHP and phylogeny model problem
set.seed(1234)
n_host <- 10
n_para <- 20
tree <- rcoal(n_host)
Z <- matrix(rbinom(n_host*n_para, 1, 0.4), nrow=n_host, ncol=n_para)
Z[,1] <- c(1,rep(0,n_host-1))
colSums(Z)
rownames(Z) <- tree$tip.label
out <- network_est(Z = Z, tree = tree, slices = 1000, model.type = 'distance')
P <- sample_parameter(param=out$param, MODEL="distance", Z=out$Z, tree=out$tree)
aux = melt(P)
aux = aux[order(aux$value,decreasing=TRUE),]
colnames(aux)<-c('Host', 'Parasite', 'p(interaction)')
head(aux)
OK I see the problem. I think it is coming from this line https://github.com/melmasri/HPprediction/blob/5444d4d23e3a146d27932c75a831c254ea0f7bb1/network_analysis.R#L290
The idea is that the probability function is 1-exp( - delta_jk)
. So for the single-host parasites delta
is NA
. When I replace it with Inf
which is the case. The probability becomes 1-exp(-Inf)=1
. If I replace it with zero it becomes 1-1 =0
. This is a choice I made, since it doesn't affect the ranking of pairs since we remove all documented interactions.
In the MCMC method, such issue is not a problem, since I do not sample probabilities for single-host parasites, since I do no use them the likelihood calculation.
Is there another way you are thinking about?
This makes sense. The model is unable to re-predict single-host parasites, and you replace these with 1s in the posterior predictions. When we are only interested in looking at the top-ranked and previously undocumented interactions, this won't make any difference. However, for model comparison and to look at the relative ranks of observed links among the predicted links, these 1s for the single-host parasites become problematic because it appears that the model has predicted them to be 1 when in fact the model did not make any predictons about them.
Instead, could we set these single-host parasites to NAs and have a warning raised, something like "The phylogeny-only model is unable to re-predict sinlge host parasites. These interactions are set to NA in the posterior predictions." Further, we could have an option for whether to set these to NA or 1 in the posterior (but use NA with the warning as default).
https://github.com/melmasri/HP-prediction/blob/5444d4d23e3a146d27932c75a831c254ea0f7bb1/network_analysis.R#L290