Open ts-dangelo opened 3 years ago
Hi @ts-dangelo, apologies for the late response. Unfortunately it took THIS long for me to be able to carve out some time to look into this carefully.
I think this is doable, and I wish to implement it. But the problem is this: when it comes to filtering based on prevalence across genomes, we have to work with 'functions', which, in some special cases, can be gene cluster names via the flag --include-gc-identity-as-function
.
This means the feature you are asking for will be eliminating functions unless they are in more than X genomes without the flag --include-gc-identity-as-function
flag and it will eliminate gene clusters unless they are in more than X genomes with the flag --include-gc-identity-as-function
. Which makes sense to me, but I wanted to make sure you don't see a problem with this either.
Hi Meren,
That sounds fine to me. The way I see it, when using this script to look at individual GC's, the user might want to pre-filter the GC's, since the amount of them can be huge, and the task of interpreting all of the results might not be feasibly accomplished anyways. I think adding -num_genomes_gene_cluster_has_hits would be great, and the user could always run the analysis with and without the flag for comparison purposes.
Thanks a lot - Tim D'Angelo
Greetings Anvio developers,
First off, many thanks for including so many great features in Anvio. In my current analysis, I was expecting to have to do much more bottom-up coding, but keep finding that what I need is already wrapped up inside Anvio.
This is a feature request that, as an added bonus, also included the resurfacing of an issue that has already been brought up: https://github.com/merenlab/anvio/issues/1383
I have been running the functional enrichment on gene-clusters as described previously (https://github.com/merenlab/anvio/issues/1196). This dataset included 222 genomes from three closely related phyla and includes a whopping 221821 gene clusters. I do not want to run these statistics on every single gene cluster, because a huge proportion of these gene clusters are singletons and doubles etc.. In this case I only want to run these tests on gene clusters that appear in 5% or more of my genomes (11 or more genomes). If the command anvi-compute-functional-enrichment had the flags -num_genomes_gene_cluster_has_hits and/or -num_genes_in_gene_cluster as filters, it would be easier to fine-tune what this command does. Behold, at least on my dataset, it does not seem like you can even run these tests without doing some work-around that involves filtering the data in this way.
I ran this command, and got the resulting errors seen below:
Taking a look at the temp file noted by the program, the issue seems to have something to do with pvalue -> qvalue approximation that seems similar to what was brought up here: https://github.com/merenlab/anvio/issues/1383
I have speculated that this issue with R has something to do with the pvalue of singletons that have a proprotion of 0 in one group and a very small positive proportion in the other group (0.0066 in this case).
What I did to get around this was, I accessed this temp file:
/var/folders/c0/_78z6hl55cv1dsk4h__khy35h18x3w/T/tmpz_7h5kfi
which is the input file to the R script written by Amy Willis. Outside of Anvio I filtered the rows of this file to only include gene clusters that were present in 5% or more of my genomes, essentially performing the feature I am requesting in this post. This reduced the number of gene clusters run through the glm test from 221821 to 5756. I took this filtered input for the R script and ran it through https://github.com/merenlab/anvio/blob/master/sandbox/anvi-script-enrichment-stats outside of Anvio, and it works as expected.
So maybe singletons cause a problem with the proportional GLM? Regardless, adding these filters would allow the user to prune data that might be unwanted or statistically hard to interpret, like singleton gene clusters, from these tests prior to performing them.
Cheers, Tim D'Angelo
Attached files:
NRA_GCs_Enrichment_input.txt = temp file /var/folders/c0/_78z6hl55cv1dsk4h__khy35h18x3w/T/tmpz_7h5kfi (unfiltered input to R script) NRA_GCs_Enrichment_input10.txt = prevalence filtered input to R script NRA_GCs_Enrichment_output10.txt = prevalence filtered output from R script
NRA_GCs_Enrichment_input.txt NRA_GCs_Enrichment_input10.txt NRA_GCs_Enrichment_output10.txt