merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
441 stars 146 forks source link

anvi-get-enriched-functions-per-pan-group should raise error if functions don't have accession values #1322

Closed ShaiberAlon closed 4 years ago

ShaiberAlon commented 4 years ago

Re: #1320

jaybake5 commented 4 years ago
anvi-self-test -v
Anvi'o version ...............................: esther (v6.1-master)
Profile DB version ...........................: 31
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

Installed using the "Following the active codebase" instructions on a Linux server (I guess that means I'm a wizard!).

Hello, I am getting the same error described in #1320, BUT I did already import the seed_eggNOG_ortholog column. I have run eggnog-mapper 2 in a separate environment successfully and imported the annotation file into Anvio using: anvi-script-run-eggnog-mapper with the --annotation flag.

Unlike @anzhangli84 in #1320, my seed_eggNOG_ortholog column is already populated. I've attached two samples of what my annotation files look when they are imported into Anvio using anvi-script-run-eggnog-mapper with the --annotation flag. Note that I did change the 'query name' to 'g00001,g00002, etc.' in the .emapper.annotations output files to address the following error:

Config Error: Gene caller ids found in this annotation file does not start with the expected 
              prefix. This is a historical glitch that is not quite easy to address          
              programmatically, so anvi'o asks you to add the expected prefix as the first   
              character of every gene call in your annotations file. This is the prefix what 
              you need to add manually to the very beginning of every line (anvi'o developers
              are very sorry for this step): 'g'.

All of the original files had a unique 8 letter prefix for each gene number that I changed to 'g' (giving this detail because I'm not sure if it's relevant to this problem)

When I run the --list-annotation-sources on my pangenome to be analyzed, I get the following output...and am presented with KEGG, COG, eggnog, etc. choices, so it seems that the functions have been imported properly:

anvi-get-enriched-functions-per-pan-group -p gracilibacteria-pan/Gracilibacteria_Pan-PAN.db -g gracilibacteria-GENOMES.db --list-annotation-sources -o sources_list2
Genomes storage .............................................: Initialized (storage hash: hashc2b6d240)                                                                                     
Num genomes in storage ......................................: 32
Num genomes will be used ....................................: 26
Pan DB ......................................................: Initialized: gracilibacteria-pan/Gracilibacteria_Pan-PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [YES]; Geometric: [YES]; Combined: [YES]

* Gene clusters are initialized for all 11721 gene clusters in the database.

Available functional annotation sources .....................: KEGG_PATHWAYS, EC_NUMBER, BiGG_Reactions, COG_CATEGORY, eggNOG_free_text, EGGNOG_BACT, GO_TERMS, BRITE, eggNOG_best_tax, KEGG_MODULE, Preferred_Name, KEGG_KO

But I get the following message when actually run the full anvi-get-enriched-functions-per-pan-group script:

anvi-get-enriched-functions-per-pan-group -p gracilibacteria-pan/Gracilibacteria_Pan-PAN.db     -g gracilibacteria-GENOMES.db     --category source     --annotation-source KEGG_MODULE     -o GRACIL-PAN-enriched-functions-source.txt     --functional-occurrence-table-output GRACIL-functions-occurrence-frequency.txt
Genomes storage .............................................: Initialized (storage hash: hashc2b6d240)                                                                                     
Num genomes in storage ......................................: 32
Num genomes will be used ....................................: 26
Pan DB ......................................................: Initialized: gracilibacteria-pan/Gracilibacteria_Pan-PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [YES]; Geometric: [YES]; Combined: [YES]

* Gene clusters are initialized for all 11721 gene clusters in the database.

Category ....................................................: source                                                                                                                       
Functional annotation source ................................: KEGG_MODULE
Exclude ungrouped ...........................................: False
Occurrence frequency of functions: ..........................: GRACIL-functions-occurrence-frequency.txt                                                                                    
Functional occurrence summary ...............................: /usr/local/scratch/MISC/jobaker/TMP/tmp9s_50_3n                                                                              

Config Error: It looks like something went wrong during the functional enrichment analysis. We
              don't know what happened, but this log file could contain some clues:           
              /usr/local/scratch/MISC/jobaker/TMP/tmp50ez11hp 

The contents of the log file:

cat /usr/local/scratch/MISC/jobaker/TMP/tmp50ez11hp
# DATE: 06 Feb 20 15:33:34
# CMD LINE: anvi-script-run-functional-enrichment-stats --input /usr/local/scratch/MISC/jobaker/TMP/tmp9s_50_3n --output GRACIL-PAN-enriched-functions-source.txt
Parsed with column specification:
cols(
  KEGG_MODULE = col_character(),
  function_accession = col_logical(),
  gene_clusters_ids = col_character(),
  associated_groups = col_character(),
  p_oral = col_double(),
  p_environmental = col_double(),
  p_unknown = col_double(),
  N_oral = col_double(),
  N_environmental = col_double(),
  N_unknown = col_double()
)
Error in smooth.spline(lambda, pi0, df = smooth.df) : 
  missing or infinite values in inputs are not allowed
Calls: %>% ... mutate_impl -> <Anonymous> -> pi0est -> smooth.spline
Execution halted

Looking at my Functional Occurrence Summary tmp file (tmp9s_50_3n.txt, attached), it looks like I still do not have values in my 'functional_accession' column either. Since I did have the seed_eggNOG_ortholog column in my annotation file, I am wondering why this is not linking up? And what I need to do so that I get 'functional_accession' values when I import annotations from eggnog-mapper?

Thanks very much for your help!

tmp9s_50_3n.txt

GCA_008015855.1.emapper.annotations.fixed.txt Tm7x.emapper.annotations.fixed.txt

meren commented 4 years ago

I assume this is resolved now :) Please correct me if I'm wrong.

Thanks!