merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
442 stars 146 forks source link

[FEATURE] Adding prevalence/abundance filter for gene clusters to functional enrichment script #1731

Open ts-dangelo opened 3 years ago

ts-dangelo commented 3 years ago

Greetings Anvio developers,

First off, many thanks for including so many great features in Anvio. In my current analysis, I was expecting to have to do much more bottom-up coding, but keep finding that what I need is already wrapped up inside Anvio.

This is a feature request that, as an added bonus, also included the resurfacing of an issue that has already been brought up: https://github.com/merenlab/anvio/issues/1383

I have been running the functional enrichment on gene-clusters as described previously (https://github.com/merenlab/anvio/issues/1196). This dataset included 222 genomes from three closely related phyla and includes a whopping 221821 gene clusters. I do not want to run these statistics on every single gene cluster, because a huge proportion of these gene clusters are singletons and doubles etc.. In this case I only want to run these tests on gene clusters that appear in 5% or more of my genomes (11 or more genomes). If the command anvi-compute-functional-enrichment had the flags -num_genomes_gene_cluster_has_hits and/or -num_genes_in_gene_cluster as filters, it would be easier to fine-tune what this command does. Behold, at least on my dataset, it does not seem like you can even run these tests without doing some work-around that involves filtering the data in this way.

I ran this command, and got the resulting errors seen below:

anvi-compute-functional-enrichment -p NRA-PAN.db -g NRA_pan-GENOMES.db -o nra_gcs_broad_enrich.txt --category-variable Sub_Broad --include-gc-identity-as-function  --annotation-source IDENTITY --functional-occurrence-table-output gc_occurence.txt

CITATION
===============================================
This program will compute enrichment scores using an R script developed by Amy
Willis. You can find more information about it in the following paper: Shaiber
et al (https://doi.org/10.1186/s13059-020-02195-w). When you publish your
findings, please do not forget to properly credit this work. :)

Genomes storage .............................................: Initialized (storage hash: hashe0f8978d)                                                                                                                                            
Num genomes in storage ......................................: 222
Num genomes will be used ....................................: 222
Pan DB ......................................................: Initialized: NRA-PAN.db (v. 14)
Gene cluster homogeneity estimates ..........................: Functional: [YES]; Geometric: [YES]; Combined: [YES]

* Gene clusters are initialized for all 221821 gene clusters in the database.

* Gene cluster identities are being added as functions into the functions                                                                                                                                                                          
dictionary. Functional annotation resources will include `IDENTITY` as an
option. See here why (apart from the fact that you asked for it by using the
flag `--include-gc-identity-as-function`):
https://github.com/merenlab/anvio/issues/1196

Category ....................................................: Sub_Broad                                                                                                                                                                           
Functional annotation source ................................: IDENTITY
Include ungrouped ...........................................: False
Occurrence frequency of functions: ..........................: gc_occurence.txt                                                                                                                                                                    
Functional occurrence summary ...............................: /var/folders/c0/_78z6hl55cv1dsk4h__khy35h18x3w/T/tmpz_7h5kfi                                                                                                                                   

Config Error: It looks like something went wrong during the functional enrichment analysis. We
              don't know what happened, but this log file could contain some clues:           
              /var/folders/c0/_78z6hl55cv1dsk4h__khy35h18x3w/T/tmp4xsdio2d  

Taking a look at the temp file noted by the program, the issue seems to have something to do with pvalue -> qvalue approximation that seems similar to what was brought up here: https://github.com/merenlab/anvio/issues/1383

**less /var/folders/c0/_78z6hl55cv1dsk4h__khy35h18x3w/T/tmp4xsdio2d** 

# DATE: 29 Apr 21 14:06:55
# CMD LINE: anvi-script-enrichment-stats --input /var/folders/c0/_78z6hl55cv1dsk4h__khy35h18x3w/T/tmpz_7h5kfi --output nra_gcs_broad_enrich.txt
Parsed with column specification:
cols(
  IDENTITY = col_character(),
  accession = col_character(),
  gene_clusters_ids = col_character(),
  associated_groups = col_character(),
  p_Surface = col_double(),
  p_Subsurface = col_double(),
  N_Surface = col_double(),
  N_Subsurface = col_double()
)
There were 50 or more warnings (use warnings() to see the first 50)
Error in pi0est(p, ...) : 
  ERROR: The estimated pi0 <= 0. Check that you have valid p-values or use a different range of lambda.
Calls: %>% ... mutate.tbl_df -> mutate_impl -> <Anonymous> -> pi0est
Execution halted

I have speculated that this issue with R has something to do with the pvalue of singletons that have a proprotion of 0 in one group and a very small positive proportion in the other group (0.0066 in this case).

What I did to get around this was, I accessed this temp file:

/var/folders/c0/_78z6hl55cv1dsk4h__khy35h18x3w/T/tmpz_7h5kfi

which is the input file to the R script written by Amy Willis. Outside of Anvio I filtered the rows of this file to only include gene clusters that were present in 5% or more of my genomes, essentially performing the feature I am requesting in this post. This reduced the number of gene clusters run through the glm test from 221821 to 5756. I took this filtered input for the R script and ran it through https://github.com/merenlab/anvio/blob/master/sandbox/anvi-script-enrichment-stats outside of Anvio, and it works as expected.

So maybe singletons cause a problem with the proportional GLM? Regardless, adding these filters would allow the user to prune data that might be unwanted or statistically hard to interpret, like singleton gene clusters, from these tests prior to performing them.

Cheers, Tim D'Angelo

Attached files:

NRA_GCs_Enrichment_input.txt = temp file /var/folders/c0/_78z6hl55cv1dsk4h__khy35h18x3w/T/tmpz_7h5kfi (unfiltered input to R script) NRA_GCs_Enrichment_input10.txt = prevalence filtered input to R script NRA_GCs_Enrichment_output10.txt = prevalence filtered output from R script

NRA_GCs_Enrichment_input.txt NRA_GCs_Enrichment_input10.txt NRA_GCs_Enrichment_output10.txt

meren commented 3 years ago

Hi @ts-dangelo, apologies for the late response. Unfortunately it took THIS long for me to be able to carve out some time to look into this carefully.

I think this is doable, and I wish to implement it. But the problem is this: when it comes to filtering based on prevalence across genomes, we have to work with 'functions', which, in some special cases, can be gene cluster names via the flag --include-gc-identity-as-function.

This means the feature you are asking for will be eliminating functions unless they are in more than X genomes without the flag --include-gc-identity-as-function flag and it will eliminate gene clusters unless they are in more than X genomes with the flag --include-gc-identity-as-function. Which makes sense to me, but I wanted to make sure you don't see a problem with this either.

ts-dangelo commented 3 years ago

Hi Meren,

That sounds fine to me. The way I see it, when using this script to look at individual GC's, the user might want to pre-filter the GC's, since the amount of them can be huge, and the task of interpreting all of the results might not be feasibly accomplished anyways. I think adding -num_genomes_gene_cluster_has_hits would be great, and the user could always run the analysis with and without the flag for comparison purposes.

Thanks a lot - Tim D'Angelo