melmasri / HPprediction

R code for a hierarchical Bayesian model for predicting ecological interactions using evolutionary relationships
1 stars 0 forks source link

rank phylogeny model for single-host parasites #7

Open melmasri opened 4 years ago

melmasri commented 4 years ago

https://github.com/melmasri/HP-prediction/blob/5444d4d23e3a146d27932c75a831c254ea0f7bb1/network_analysis.R#L290

melmasri commented 4 years ago

@maxfarrell

melmasri commented 4 years ago

@maxfarrell I don't think there is anything wrong with this issue. The topPairs are not in the database, see the following code, and please confirm

data(gmpd)
data(mammal_supertree)

gmpd <- gmpd[grep("sp[.]",gmpd$ParasiteCorrectedName, invert=TRUE),]
gmpd <- gmpd[grep("not identified",gmpd$ParasiteCorrectedName, invert=TRUE),]

# Subsetting host family Bovidae
gmpd <- gmpd[gmpd$HostFamily=="Bovidae",]

# Formatting host names to match phylogeny
gmpd$HostCorrectedName <- gsub(" ","_", gmpd$HostCorrectedName)

# Creating binary interaction matrix
com <- table(gmpd$HostCorrectedName, gmpd$ParasiteCorrectedName)
com[com>1] <- 1
com <- as.matrix(unclass(com), nrow=nrow(com), ncol=ncol(com))

# loading phylogeny and pruning to hosts in the interaction matrix
mammal_supertree <- drop.tip(mammal_supertree, mammal_supertree$tip.label[!mammal_supertree$tip.label%in%rownames(com)])

out <- network_est(Z = com, tree = mammal_supertree, slices = 1000, model.type = 'distance')
P <- sample_parameter(param=out$param, MODEL="distance", Z=out$Z, tree=out$tree, size = 500)

a = topPairs(P, out$Z)
a
for(i in 1:nrow(a))
    print(paste(a[i,1], a[i,2], 'interaction exist', out$Z[a[i,1], a[i,2]]==1))

> > >                        Hosts            Parasites         p
9555           Nanger_granti Haemonchus contortus 0.1351618
9561  Cephalophus_nigrifrons Haemonchus contortus 0.1300700
9527       Connochaetes_gnou Haemonchus contortus 0.1258669
9542     Rupicapra_pyrenaica Haemonchus contortus 0.1255974
9536            Capra_hircus Haemonchus contortus 0.1253233
9538           Capra_nubiana Haemonchus contortus 0.1253233
9540          Capra_sibirica Haemonchus contortus 0.1253233
9558    Cephalophus_dorsalis Haemonchus contortus 0.1250568
9559 Cephalophus_silvicultor Haemonchus contortus 0.1250568
9552         Gazella_gazella Haemonchus contortus 0.1248039
> + [1] "Nanger_granti Haemonchus contortus interaction exist FALSE"
[1] "Cephalophus_nigrifrons Haemonchus contortus interaction exist FALSE"
[1] "Connochaetes_gnou Haemonchus contortus interaction exist FALSE"
[1] "Rupicapra_pyrenaica Haemonchus contortus interaction exist FALSE"
[1] "Capra_hircus Haemonchus contortus interaction exist FALSE"
[1] "Capra_nubiana Haemonchus contortus interaction exist FALSE"
[1] "Capra_sibirica Haemonchus contortus interaction exist FALSE"
[1] "Cephalophus_dorsalis Haemonchus contortus interaction exist FALSE"
[1] "Cephalophus_silvicultor Haemonchus contortus interaction exist FALSE"
[1] "Gazella_gazella Haemonchus contortus interaction exist FALSE"
> 
melmasri commented 4 years ago

@maxfarrell have you had the chance to look into this? or should I close it?

maxfarrell commented 4 years ago

Here is a reproducible example - you can see the two single host parasites are given a probability of 1 in the output. The same behaviour occurs when the model is run with cv and min.per.col is greater than 1.

# simulated SHP and phylogeny model problem

set.seed(1234)

n_host <- 10
n_para <- 20
tree <- rcoal(n_host)
Z <- matrix(rbinom(n_host*n_para, 1, 0.4), nrow=n_host, ncol=n_para)
Z[,1] <- c(1,rep(0,n_host-1))
colSums(Z)
rownames(Z) <- tree$tip.label

out <- network_est(Z = Z, tree = tree, slices = 1000, model.type = 'distance')

P <- sample_parameter(param=out$param, MODEL="distance", Z=out$Z, tree=out$tree)

aux = melt(P)
aux = aux[order(aux$value,decreasing=TRUE),]
colnames(aux)<-c('Host', 'Parasite', 'p(interaction)')
head(aux)
melmasri commented 4 years ago

OK I see the problem. I think it is coming from this line https://github.com/melmasri/HPprediction/blob/5444d4d23e3a146d27932c75a831c254ea0f7bb1/network_analysis.R#L290

The idea is that the probability function is 1-exp( - delta_jk). So for the single-host parasites delta is NA. When I replace it with Inf which is the case. The probability becomes 1-exp(-Inf)=1. If I replace it with zero it becomes 1-1 =0. This is a choice I made, since it doesn't affect the ranking of pairs since we remove all documented interactions.

In the MCMC method, such issue is not a problem, since I do not sample probabilities for single-host parasites, since I do no use them the likelihood calculation.

Is there another way you are thinking about?

maxfarrell commented 4 years ago

This makes sense. The model is unable to re-predict single-host parasites, and you replace these with 1s in the posterior predictions. When we are only interested in looking at the top-ranked and previously undocumented interactions, this won't make any difference. However, for model comparison and to look at the relative ranks of observed links among the predicted links, these 1s for the single-host parasites become problematic because it appears that the model has predicted them to be 1 when in fact the model did not make any predictons about them.

Instead, could we set these single-host parasites to NAs and have a warning raised, something like "The phylogeny-only model is unable to re-predict sinlge host parasites. These interactions are set to NA in the posterior predictions." Further, we could have an option for whether to set these to NA or 1 in the posterior (but use NA with the warning as default).