merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
443 stars 145 forks source link

default_matrix parser issue #2033

Closed astro-noodles closed 1 year ago

astro-noodles commented 1 year ago

Hello! I am having trouble with the default_matrix taxonomy import parser. I need to manually import my taxonomy from Kaiju (I could not use the Kaiju parser due to the incompatibility of the newer Kaiju version with Anvio). I formatted my Kaiju output and created a matrix file with 8 columns (1 for gene callers ID, and 7 for taxonomy), and used the anvi-import-taxonomy-for-genes -c CONTIGS.db -i input_matrix.txt -p default_matrix but am not successful at all. I was hoping you can help since I have been stuck for several days now. Here is the error:

...

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1 (... 355747 more ...)

Total num hits found .........................: 0

Config Error: Your parser (default_matrix) returned an empty result for your input file.
Something must have gone wrong. Maybe you selected a wrong parser, or the input
file format has changed between when this parser was implemented and now. Either
ways, if you have exhausted your ideas for troubleshooting you should send an
e-mail to anvio developers! Sorry for this!

I can provide the input files if needed for inspection.

I am using Windows 10 Pro (sorry), Intel i7, 32gb RAM, and have been successful with every step of A’nvio (up until this point) using the latest pull of the Docker image (Thank you for making it available with almost no fuss with installation as I am not a “regular” bioinformatician and only need to do this metagenome analysis once).

meren commented 1 year ago

Hey @astrobiophile,

Can you please send the first 10 lines of your input matrix?

astro-noodles commented 1 year ago

Here it is: kaiju-lvbr-spades-fixed-matrix.names_first10.txt

Thanks for taking a look, @meren !

meren commented 1 year ago

Hey @astrobiophile,

When I look at the contents of input file, which is supposed to match the gene-taxonomy-txt artifact, I se this:

I see this:

c_000000000004 Bacteria Proteobacteria Campylobacterales Epsilonproteobacteria Sulfurovaceae Sulfurovum Sulfurovum sp.
c_000000000017 Bacteria Actinobacteria Corynebacteriales Actinomycetia Corynebacteriaceae Corynebacterium Corynebacterium sp. 4H37-19
c_000000000035 Viruses Uroviricota Caudovirales Caudoviricetes Siphoviridae NA Microbacterium phage PauloDiaboli
c_000000000048 Bacteria Proteobacteria Pseudomonadales Gammaproteobacteria Marinobacteraceae Marinobacter Marinobacter sp. LV10R510-11A
c_000000000061 Bacteria Bacteroidetes Flavobacteriales Flavobacteriia Flavobacteriaceae Gillisia Gillisia mitskevichiae
c_000000000070 Bacteria Bacteroidetes Flavobacteriales Flavobacteriia Flavobacteriaceae Salegentibacter NA
c_000000000073 Bacteria Proteobacteria Burkholderiales Betaproteobacteria Comamonadaceae Variovorax NA
c_000000000082 Viruses Uroviricota Caudovirales Caudoviricetes NA NA uncultured Caudovirales phage
c_000000000097 Viruses Uroviricota Caudovirales Caudoviricetes NA NA uncultured Caudovirales phage

This is not the right format as (1) the first column should be gene caller ids, and not contig names, and (2) there should be a header with the following column names:

gene_callers_id | t_domain | t_phylum | t_class | t_order | t_family | t_genus | t_species

All of which are explained in the blog post linked from the artifact page.

astro-noodles commented 1 year ago

You are absolutely right, that was a mistake on my part. I used the wrong fasta file version for taxonomic classification. With the correct file for the classifier, I am able to manually parse the taxonomy effortlessly with the default_parser. Thanks for the explanation!