katiesaund / hogwash

Three bacterial GWAS methods all rolled into one easy-to-use R package
Other
19 stars 5 forks source link

ERROR: Tree edge lengths are unreasonably long #88

Closed nicola-palmieri closed 2 years ago

nicola-palmieri commented 2 years ago

Dear Author

I am trying to apply hogwash to a dataset from Gallibacterium (12 strains) to find associations between SNPs and antibiotic resistance profiles. I have produced the snps, phenotype matrix and tree as described. After running the hogwash command using a subset of the snps matrix (first 100 snps), I get the following error:

> hw = hogwash(pheno = pheno[tree$tip.label,"AMP",drop=F], geno = snps[tree$tip.label,1:100], tree = tree)

[1] "Ancestral reconstruction 25% complete: 2022-01-13 14:25:10"
[1] "Ancestral reconstruction 50% complete: 2022-01-13 14:25:11"
[1] "Ancestral reconstruction 75% complete: 2022-01-13 14:25:12"
[1] "Ancestral reconstruction complete: 2022-01-13 14:25:13"
Error in identify_short_edges(tr) : 
  Tree edge lengths are unreasonably long compared to the other
           edges.

I attach a file of the tree I am using (the attached tree includes also a reference genome), which I removed before providing the tree to hogwash.

parsnp-tree

Thank you! Nicola

katiesaund commented 2 years ago

Hi Nicola,

This issue comes up because the tree you're working with is quite small (very few samples).

In the manuscript I describe in the methods section that we apply the following quality control metrics on the tree:
"Low-confidence edges are defined as those edges with low bootstrap support (default <70 %), those that are more than 10 % of the total tree length, or those with low genotype or phenotype ancestral reconstruction support (maximum likelihood <0.875). Low-confidence edges are ignored during permutation testing."

We do filtering to remove long branches because ancestral reconstruction performed on very long edges is not credible. It's likely that the trait has taken on more than one value over that branch and hogwash is not nuanced enough to incorporate a pool of reconstructed trait values. Unfortunately, the criteria that we exclude any branches that are more than 10% of the total tree length can be difficult to clear if you have very few samples. In fact, because of this hogwash won't work for trees with 10 or fewer branches which works out to <7 samples in a fully bifurcating tree.

Thank you for including the tree image: It shows that the branch to UMN179 is very long. I expect that that some of the other edges are very long as well, just hard to see because to the very long branch to UMN179 on this scale. I assume that hogwash is filtering out to down to <7 samples.

If you're interested in running hogwash despite the small numbers of samples I've implemented a branch of the code called: "few_samples" that you can try out on your tree. If it works and you find a useful result, please let me know and I'll consider pushing this change to accept small trees into the main function so others can also run hogwash if they have very few samples. To access this version of the code please run the following code to access this specific branch:

devtools::install_github("katiesaund/hogwash@few_samples" )
library(hogwash) 

Given the small sample size please keep in mind the power your current study has to detect associations.

Please let me know if this code change does not fix the issue or if you have other issues and questions.

All the best, Katie

nicola-palmieri commented 2 years ago

 Dear Katie

thank you for the quick response. I have just started playing for the first time with GWAS analysis, as I am planning a large experiment with many bacterial sequences (>100) and corresponding antibiotic profiles. I understand that one needs lot of not so closely related samples to do a good GWAS.

For now I was just playing with this small dataset which is also not ideal, as many individuals are very closely related. I was curious if can get some hints about which genes are related to which antibiotic resistance. I am curious to try your new implementation.

I can let you know if it is interesting for you.

Thank you!

Kind regards Nicola

On Mon, 17 Jan 2022 at 21:49, Katie Saund @.***> wrote:

Hi Nicola,

This issue comes up because the tree you're working with is quite small (very few samples).

In the manuscript I describe in the methods section that we apply the following quality control metrics on the tree: "Low-confidence edges are defined as those edges with low bootstrap support (default <70 %), those that are more than 10 % of the total tree length, or those with low genotype or phenotype ancestral reconstruction support (maximum likelihood <0.875). Low-confidence edges are ignored during permutation testing."

We do filtering to remove long branches because ancestral reconstruction performed on very long edges is not credible. It's likely that the trait has taken on more than one value over that branch and hogwash is not nuanced enough to incorporate a pool of reconstructed trait values. Unfortunately, the criteria that we exclude any branches that are more than 10% of the total tree length can be difficult to clear if you have very few samples. In fact, because of this hogwash won't work for trees with 10 or fewer branches which works out to <7 samples in a fully bifurcating tree.

Thank you for including the tree image: It shows that the branch to UMN179 is very long. I expect that that some of the other edges are very long as well, just hard to see because to the very long branch to UMN179 on this scale. I assume that hogwash is filtering out to down to <7 samples.

If you're interested in running hogwash despite the small numbers of samples I've implemented a branch of the code called: "few_samples" that you can try out on your tree. If it works and you find a useful result, please let me know and I'll consider pushing this change to accept small trees into the main function so others can also run hogwash if they have very few samples. To access this version of the code please run the following code to access this specific branch:

@.***_samples" )

library(hogwash)

Given the small sample size please keep in mind the power your current study has to detect associations.

Please let me know if this code change does not fix the issue or if you have other issues and questions.

All the best, Katie

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.

-- Nicola Palmieri Bioinformatician University Clinic for Poultry and Fish Medicine Vetmeduni Wien

nicola-palmieri commented 2 years ago

I have tried to run hogwash@few_samples using the first 1000 SNPs of my dataset or all ~60000 SNPs and I get the following errors in each case:

hw = hogwash(pheno = pheno[tree$tip.label,"AMP",drop=F], geno = snps[tree$tip.label,1:1000], tree = tree, perm = 500) [1] "Ancestral reconstruction 25% complete: 2022-01-20 11:18:15" [1] "Ancestral reconstruction 50% complete: 2022-01-20 11:18:23" [1] "Ancestral reconstruction 75% complete: 2022-01-20 11:18:30" [1] "Ancestral reconstruction complete: 2022-01-20 11:18:38" Error in get_dropped_genotypes(geno, geno_to_keep) : There are no genotypes left to test.

hw = hogwash(pheno = pheno[tree$tip.label,"AMP",drop=F], geno = snps[tree$tip.label, ], tree = tree, perm = 500) [1] "Ancestral reconstruction 25% complete: 2022-01-20 12:44:52" [1] "Ancestral reconstruction 50% complete: 2022-01-20 12:52:37" [1] "Ancestral reconstruction 75% complete: 2022-01-20 13:00:34" [1] "Ancestral reconstruction complete: 2022-01-20 13:08:34" [1] "Permutation test complete: 2022-01-20 13:10:05" Error in caper::phylo.d(data = compar_data_obj, names.col = ID, binvar = trait, : Phylogeny contains pairs of tips on zero branch lengths, cannot currently simulate

I guess there is probably too little power in these data, could you help me to understand these errors? If you want I could send you the snp table, phenotype table and tree.

Thanks Nicola

katiesaund commented 2 years ago

The first error message There are no genotypes left to test. occurred because in the 1,000 SNPs you used in there were no genotypes with at least 2 high confidence genotype transition edges. When you scaled up to all 60,000 SNPs you didn't get that error because in the additional 59,000 SNPs there were at least some SNPs whose ancestral reconstruction showed at least 2 high confidence transition edges in the tree.

The second error message occurs when hogwash calls an external a function called caper::phylo. This function is used to determine the phylogenetic signal for the phenotype given the tree. The error is should occur when there are edges in a tree with length == 0. Important to know: the message you saw "Permutation test complete: 2022-01-20 13:10:05" indicates that permutation test with p-value calculation did successfully run. This means there isn't a formatting issue with your data for running the core hogwash algorithm; the problem is just this one function performing the phylogenetic signal calculation.

Below is some code to demonstrate the issue:

Set up a dummy tree

library(ape)
library(caper)
num_tip <- 25
set.seed(1)
pheno <- matrix(rbinom(num_tip, 1, 0.6))
set.seed(1)
tree <- ape::rcoal(n = num_tip)
tree$node.label <- rep(100, ape::Nnode(tree))
row.names(pheno) <- tree$tip.label

The following code runs for this dummy tree without error message (this is the behavior we expect from phylo.d and therefore from hogwash:

tree$node.label <- NULL
temp_trait <- pheno[, 1, drop = FALSE]
temp_trait <- as.data.frame(temp_trait)
temp_trait <- cbind(tree$tip.label, temp_trait)
colnames(temp_trait) <- c("ID", "trait")
row.names(temp_trait) <- NULL
compar_data_obj <- caper::comparative.data(data = temp_trait,
                                           phy = tree,
                                           names.col =  ID)
set.seed(1)
d_stat_results <- caper::phylo.d(data = compar_data_obj,
                                 names.col = ID,
                                 binvar = trait,
                                 phy = tree)
d_stat <- d_stat_results$DEstimate
d_stat

No error messages.

Now, if I modify the dummy tree in such a way that I shrink two adjacent terminal edges to length 0 I can reproduce the error message.

tree2 <- tree
tree2$edge.length[46:47] <- 0
plot(tree2)
ape::edgelabels()

compar_data_obj2 <- caper::comparative.data(data = temp_trait,
                                           phy = tree2,
                                           names.col =  ID)
set.seed(1)
caper::phylo.d(data = compar_data_obj2,
               names.col = ID,
               binvar = trait,
               phy = tree2)
# "Phylogeny contains pairs of tips on zero branch lengths, cannot currently simulate"

The error message does not occur if the edges are tiny, but non-zero

tree3 <- tree
tree3$edge.length[46:47] <- 0.00000000000000000000000000000000000000000000000001
compar_data_obj3 <- caper::comparative.data(data = temp_trait,
                                            phy = tree3,
                                            names.col =  ID)
set.seed(1)
d_stat_results3 <- caper::phylo.d(data = compar_data_obj3,
                                  names.col = ID,
                                  binvar = trait,
                                  phy = tree3)

My suggestion is to check if you have edges with length 0. You can either remove a sample whose terminal edge is zero or you could assign a tiny, but non-zero edge length. There are draw backs to either approach, but since you're just exploring the utility of hogwash on a pilot data set I don't expect there to be a meaningful impact to your analysis at this stage.

If your tree does not have any edges with length 0 please let me know! I would need to dig deeper to get to the source of the error in that case.

I hope this helps! -Katie

nicola-palmieri commented 2 years ago

Dear Katie

sorry to bother you again, but I continuously get a NULL output, I have set the edge lengths to a small value and tried to fit 3 different antibiotic profiles to the SNPs. Below I also show the first 10 SNPs and the single column data frame with the antibiotic binary vector for cefazoline.

What could be the reason?

Thank you again! Nicola

> hw = hogwash(pheno = pheno[tree$tip.label,"CEZ",drop=F], geno = snps[tree$tip.label, ], tree = tree, perm = 500)
[1] "Ancestral reconstruction 25% complete: 2022-02-08 15:30:27"
[1] "Ancestral reconstruction 50% complete: 2022-02-08 15:38:07"
[1] "Ancestral reconstruction 75% complete: 2022-02-08 15:45:46"
[1] "Ancestral reconstruction complete: 2022-02-08 15:53:36"
[1] "Permutation test complete: 2022-02-08 15:55:07"
[1] "Ancestral reconstruction 25% complete: 2022-02-08 17:16:28"
[1] "Ancestral reconstruction 50% complete: 2022-02-08 17:24:17"
[1] "Ancestral reconstruction 75% complete: 2022-02-08 17:31:51"
[1] "Ancestral reconstruction complete: 2022-02-08 17:39:22"
[1] "Permutation test complete: 2022-02-08 17:40:50"
> hw
NULL

> snps[tree$tip.label, 1:10]
           231 235 236 393 817 818 869 2648 2716 2822
51_liver_2   1   1   1   0   0   0   0    0    0    0
51_ovar_2    1   1   1   0   0   0   0    0    0    0
24_heart_2   0   0   0   1   1   1   1    1    1    1
24_ovar_2    0   0   0   1   1   1   1    1    1    1
24_ovar_1    0   0   0   1   1   1   1    1    1    1
24_heart_1   0   0   0   1   1   1   1    1    1    1
4_ovar_1     1   1   0   0   0   0   1    1    1    1
4_ovar_2     1   1   0   0   0   0   1    1    1    1
4_liver_1    1   1   0   0   0   0   1    1    1    1
4_liver_2    1   1   0   0   0   0   1    1    1    1
51_liver_1   1   1   1   0   0   0   0    0    0    0
51_ovar_1    1   1   1   0   0   0   0    0    0    0

           CEZ
51_liver_2   1
51_ovar_2    1
24_heart_2   1
24_ovar_2    1
24_ovar_1    1
24_heart_1   1
4_ovar_1     0
4_ovar_2     0
4_liver_1    1
4_liver_2    1
51_liver_1   1
51_ovar_1    1
nicola-palmieri commented 2 years ago

I just realized the hogwash function is not supposed to return any value, but it writes directly to an output file.

Thanks for all the help. Now the issue is solved!