sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
470 stars 80 forks source link

Error: Identities do not match taxonomy database when using sourmash tax metagenome #2538

Open bryshalm opened 1 year ago

bryshalm commented 1 year ago

The identities of the query genome and the metagenome database are not matching. I am getting an error message indicating the genome is not present in the database when it is. Error message shown below.

== This is sourmash version 4.6.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 21 gather results from 'SRR596245_V1.csv'.
of 21 gather results, missed 21 lineage assignments.
The following are missing from the taxonomy information: MF034645,HE998670,MI348317,KY379875,CAQA01000070,GU568014,KY860925,DQ465520,JAOC01000010,AY230188,KF069515,KU341118,EU477537,AB302407,BDFN01001938,LS483396,KM221486,MF664522,EU290705,JQ924057,FJ673326
loaded 21 results total from 1 gather CSVs
ERROR: ident BDFN01001938.3624.5436 is not in the taxonomy database.

Steps: Within the /brymoore/sourmash_16S/sample_fasta_files created a sqlite3 database to reduce time.

sourmash tax prepare -t ../silva/SILVA_138_taxonomy.csv -o silva.taxdb -F sql

Test sourmash tax metagenome with the sqldb.

sourmash tax metagenome -g SRR596245_V1.csv -t silva.taxdb -o SRR596245_V1

Output:


== This is sourmash version 4.6.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 21 gather results from 'SRR596245_V1.csv'.
loaded 21 results total from 1 gather CSVs
Traceback (most recent call last):
  File "/home/brymoore/miniconda3/envs/sourmash_env/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/sourmash/__main__.py", line 13, in main    return mainmethod(args)
  File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/sourmash/cli/tax/metagenome.py", line 96, in main
    return sourmash.tax.__main__.metagenome(args)
  File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/sourmash/tax/__main__.py", line 106, in metagenome
    summarized_gather[rank], seen_perfect, _ = tax_utils.summarize_gather_at(rank, tax_assign, gather_results, skip_idents=idents_missed,
  File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/sourmash/tax/tax_utils.py", line 257, in summarize_gather_at
    assert lineage[-1].rank == rank, lineage[-1]
AssertionError: LineagePair(rank='family', name='Symphytocarpus impexus')

Checked for the presence of Symphytocarpus impexus within SILVA_138_taxonomy.txt.

grep "Symphytocarpus impexus" ../silva/SILVA_138_taxonomy.txt

Output:

>AY230188.1.2014 Eukaryota;Amorphea;Amoebozoa;Myxogastria;Symphytocarpus impexus

Adapted SILVA_138_taxonomy.csv to include "NA" in empyt taxonomic rank positions. Believe the error is ocurring due to the difference in tax rank of some of the microorganisms. The standard seems to be 7 or 8 while the Eukrayota seem to have only 5. File location: brymoore/sourmash_16s/silva/tax_python_scriptaddna.sh.

python3 tax_python_scriptaddna.sh

tax_python_scriptaddna.sh:

import pandas as pd
tax_file= "SILVA_138_taxonomy.txt"

tax = pd.read_csv(tax_file, header=None, sep="\t")
tax = tax[0].str.split(' ',n=1,  expand=True)
tax.set_index([0], inplace=True)
tax = tax[1].str.split(';',n=6, expand=True)
tax.rename(columns={0:'superkingdom', 1: 'phylum', 2: 'class', 3:'order', 4:'family', 5: 'genus', 6:'species'}, inplace=True)
tax.index.name = "ident"
tax_out = "SILVA_138_taxonomy.csv"
tax.to_csv(tax_out, na_rep="NA")

Within the /brymoore/sourmash_16S/sample_fasta_files.Confirmed "NA" would be added to empty tax rank positions.

grep "Symphytocarpus impexus" ../silva/SILVA_138_taxonomy.csv

Output:

>AY230188.1.2014,Eukaryota,Amorphea,Amoebozoa,Myxogastria,Symphytocarpus impexus,NA,NA

Test sourmash tax metagenome again.

sourmash tax metagenome -g SRR596245_V1.csv -t ../silva/SILVA_138_taxonomy.csv -o SRR596245_V1

Output:

== This is sourmash version 4.6.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 21 gather results from 'SRR596245_V1.csv'.
of 21 gather results, missed 21 lineage assignments.
The following are missing from the taxonomy information: KU341118,AB302407,EU290705,JQ924057,EU477537,MF034645,MF664522,CAQA01000070,KY860925,JAOC01000010,LS483396,KY379875,BDFN01001938,DQ465520,KM221486,MI348317,HE998670,KF069515,AY230188,FJ673326,GU568014
loaded 21 results total from 1 gather CSVs
saving 'csv_summary' output to 'SRR596245_V1.summarized.csv'.

Checked for the presence of KY379875 within the SILVA_138_taxonomy.csv.

grep "KY379875" ../silva/SILVA_138_taxonomy.csv

Output:

>KY379875.1.3005,Bacteria,Cyanobacteria,Cyanobacteriia,Synechococcales,Cyanobiaceae,Cyanobium PCC-6307,Cyanobium sp. CZS_40A

Checked for the presence of KY379875 within the query file.

grep "KY379875" SRR596245_V1.csv

Output:

487,0.07102231296485344,0.09023941068139964,0.007145982207962666,0.007145982207962666,,,,/home/brymoore/sourmash_16S/silva/zipped_directory_sigs/SILVA_138_k7.fasta.sig.zip,KY379875.1.3005 Bacteria;Cyanobacteria;Cyanobacteriia;Synechococcales;Cyanobiaceae;Cyanobium PCC-6307;Cyanobium sp. CZS_40A,6f4833c8a50a0c981bcee5ce90b40ab4,0.8968692449355433,49,7,226,SRR596245_V1.fastq,SRR596245.1 GLKYXIS02GHTVR/2,8f434ab3,6857,7,DNA,1,6857,False,0.6853517791641982,0.9845709525841526,0.8349613658741755,0.9845709525841526,False,,1656,6857

Used sourmash tax metagenome with --keep-full-identifiers.

sourmash tax metagenome -g SRR596245_V1.csv -t ../silva/SILVA_138_taxonomy.csv -o SRR596245_V1 --keep-identifier-versions

Output:

== This is sourmash version 4.6.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 21 gather results from 'SRR596245_V1.csv'.
of 21 gather results, missed 21 lineage assignments.
The following are missing from the taxonomy information: MF034645,HE998670,MI348317,KY379875,CAQA01000070,GU568014,KY860925,DQ465520,JAOC01000010,AY230188,KF069515,KU341118,EU477537,AB302407,BDFN01001938,LS483396,KM221486,MF664522,EU290705,JQ924057,FJ673326
loaded 21 results total from 1 gather CSVs
ERROR: ident BDFN01001938.3624.5436 is not in the taxonomy database.
ctb commented 1 year ago

thanks @bryshalm! digging in now.

created a fake/small tax CSV file:

head -1 ~brymoore/sourmash_16S/silva/SILVA_138_taxonomy.csv > smalltax.csv
grep "Symphytocarpus impexus" ~brymoore/sourmash_16S/silva/SILVA_138_taxonomy.csv  >> smalltax.csv

copied in file:

cp ~brymoore/sourmash_16S/sample_fasta_files/SRR596245_V1.csv .

ran:

sourmash tax metagenome -t smalltax.csv -g SRR596245_V1.csv

and got an exciting!! new!!! error:

== This is sourmash version 4.7.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 gather results from 'SRR596245_V1.csv'.
loaded results for 1 queries from 1 gather CSVs
of 21 gather results, lineage assignments for 21 results were missed.
The following are missing from the taxonomy information: KY860925, HE998670, FJ673326, AY230188, KU341118, KM221486, EU290705, AB302407, GU568014, CAQA01000070, DQ465520, MF034645, EU477537, BDFN01001938, LS483396, MF664522, KY379875, KF069515, JQ924057, JAOC01000010, MI348317
Starting summarization up rank(s): strain, species, genus, family, order, class, phylum, superkingdom 
Traceback (most recent call last):
  File "/home/ctbrown/miniconda3/envs/sourmash-dev/bin/sourmash", line 8, in <module>
    sys.exit(main())
  File "/home/ctbrown/sourmash/src/sourmash/__main__.py", line 19, in main
    retval = mainmethod(args)
  File "/home/ctbrown/sourmash/src/sourmash/cli/tax/metagenome.py", line 95, in main
    return sourmash.tax.__main__.metagenome(args)
  File "/home/ctbrown/sourmash/src/sourmash/tax/__main__.py", line 155, in metagenome
    tax_utils.write_summary(query_gather_results, out_fp, limit_float_decimals=limit_float)
  File "/home/ctbrown/sourmash/src/sourmash/tax/tax_utils.py", line 637, in write_summary
    header, summary = q_res.make_full_summary(limit_float=limit_float_decimals, classification=classification)
  File "/home/ctbrown/sourmash/src/sourmash/tax/tax_utils.py", line 1752, in make_full_summary
    self.check_summarization()
  File "/home/ctbrown/sourmash/src/sourmash/tax/tax_utils.py", line 1724, in check_summarization
    raise ValueError("lineages not summarized yet.")
ValueError: lineages not summarized yet.

key point is this reproduces the problem that The following are missing from the taxonomy information:... which to me looks like at least one critical issue!

...oh. duh. smalltax.csv looks like this:

ident,superkingdom,phylum,class,order,family,genus,species
AY230188.1.2014,Eukaryota,Amorphea,Amoebozoa,Myxogastria,Symphytocarpus impexus,NA,NA

and it looks like the problem is that we're not removing the > at the beginning. Sigh.

@bryshalm in your tax_python_scriptaddna.sh, suggest adding a tax = tax[0].str.slice(start=1) right after the first str.split. Give that a try and let me know how it goes ;)

import pandas as pd
tax_file= "SILVA_138_taxonomy.txt"

tax = pd.read_csv(tax_file, header=None, sep="\t")
tax = tax[0].str.split(' ',n=1,  expand=True)
tax.set_index([0], inplace=True)
tax = tax[1].str.split(';',n=6, expand=True)
tax.rename(columns={0:'superkingdom', 1: 'phylum', 2: 'class', 3:'order', 4:'family', 5: 'genus', 6:'species'}, inplace=True)
tax.index.name = "ident"
tax_out = "SILVA_138_taxonomy.csv"
tax.to_csv(tax_out, na_rep="NA")
bryshalm commented 1 year ago

After including the tax = tax[0].str.slice(start=1) in tax_python_scriptaddna.py in the ~brymoore/sourmash_16S/silva directory.

import pandas as pd
tax_file= "SILVA_138_taxonomy.txt"

tax = pd.read_csv(tax_file, header=None, sep="\t")
tax = tax[0].str.split(' ',n=1,  expand=True)
tax = tax[0].str.slice(start=1)
tax.set_index([0], inplace=True)
tax = tax[1].str.split(';',n=6, expand=True)
tax.rename(columns={0:'superkingdom', 1: 'phylum', 2: 'class', 3:'order', 4:'family', 5: 'genus', 6:'species'}, inplace=True)
tax.index.name = "ident"
tax_out = "SILVA_138_taxonomy.csv"
tax.to_csv(tax_out, na_rep="NA")

And run

python3 tax_python_scriptaddna.py

I get this error message:

Traceback (most recent call last):                                                                                                         
 File "/home/brymoore/sourmash_16S/silva/tax_python_scriptaddna.py", line 7, in <module>
 tax.set_index([0], inplace=True)                                                                                                        
File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/pandas/core/generic.py", line 5902, in __getattr__           
return object.__getattribute__(self, name)                                                                                            
AttributeError: 'Series' object has no attribute 'set_index'. Did you mean: 'reset_index'? 

When I try reset_index

import pandas as pd
tax_file= "SILVA_138_taxonomy.txt"

tax = pd.read_csv(tax_file, header=None, sep="\t")
tax = tax[0].str.split(' ',n=1,  expand=True)
tax = tax[0].str.slice(start=1)
tax.reset_index([0], inplace=True)
tax = tax[1].str.split(';',n=6, expand=True)
tax.rename(columns={0:'superkingdom', 1: 'phylum', 2: 'class', 3:'order', 4:'family', 5: 'genus', 6:'species'}, inplace=True)
tax.index.name = "ident"
tax_out = "SILVA_138_taxonomy.csv"
tax.to_csv(tax_out, na_rep="NA")

And run

python3 tax_python_scriptaddna.py

I get:

Traceback (most recent call last):
  File "/home/brymoore/sourmash_16S/silva/tax_python_scriptaddna.py", line 7, in <module>
    tax.reset_index([0], inplace=True)
  File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/pandas/core/series.py", line 1568, in reset_index
    raise TypeError(
TypeError: Cannot reset_index inplace on a Series to create a DataFrame
(sourmash_env) brymoore@c8-92:~/sourmash_16S/silva$ nano tax_python_scriptaddna.py
(sourmash_env) brymoore@c8-92:~/sourmash_16S/silva$ python3 tax_python_scriptaddna.py
Traceback (most recent call last):
  File "/home/brymoore/sourmash_16S/silva/tax_python_scriptaddna.py", line 8, in <module>
    tax = tax[1].str.split(';',n=6, expand=True)
AttributeError: 'str' object has no attribute 'str'
bryshalm commented 1 year ago

I didn't edit the python script, but I did remove the > using the command below.

cat SILVA_138_taxonomy.csv | sed “s/>//g” > SILVA_taxonomy.csv

I was then able to successfully use sourmash tax metagenome with the SILVA_taxonomy.csv file.