Open bryshalm opened 1 year ago
thanks @bryshalm! digging in now.
created a fake/small tax CSV file:
head -1 ~brymoore/sourmash_16S/silva/SILVA_138_taxonomy.csv > smalltax.csv
grep "Symphytocarpus impexus" ~brymoore/sourmash_16S/silva/SILVA_138_taxonomy.csv >> smalltax.csv
copied in file:
cp ~brymoore/sourmash_16S/sample_fasta_files/SRR596245_V1.csv .
ran:
sourmash tax metagenome -t smalltax.csv -g SRR596245_V1.csv
and got an exciting!! new!!! error:
== This is sourmash version 4.7.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loaded 1 gather results from 'SRR596245_V1.csv'.
loaded results for 1 queries from 1 gather CSVs
of 21 gather results, lineage assignments for 21 results were missed.
The following are missing from the taxonomy information: KY860925, HE998670, FJ673326, AY230188, KU341118, KM221486, EU290705, AB302407, GU568014, CAQA01000070, DQ465520, MF034645, EU477537, BDFN01001938, LS483396, MF664522, KY379875, KF069515, JQ924057, JAOC01000010, MI348317
Starting summarization up rank(s): strain, species, genus, family, order, class, phylum, superkingdom
Traceback (most recent call last):
File "/home/ctbrown/miniconda3/envs/sourmash-dev/bin/sourmash", line 8, in <module>
sys.exit(main())
File "/home/ctbrown/sourmash/src/sourmash/__main__.py", line 19, in main
retval = mainmethod(args)
File "/home/ctbrown/sourmash/src/sourmash/cli/tax/metagenome.py", line 95, in main
return sourmash.tax.__main__.metagenome(args)
File "/home/ctbrown/sourmash/src/sourmash/tax/__main__.py", line 155, in metagenome
tax_utils.write_summary(query_gather_results, out_fp, limit_float_decimals=limit_float)
File "/home/ctbrown/sourmash/src/sourmash/tax/tax_utils.py", line 637, in write_summary
header, summary = q_res.make_full_summary(limit_float=limit_float_decimals, classification=classification)
File "/home/ctbrown/sourmash/src/sourmash/tax/tax_utils.py", line 1752, in make_full_summary
self.check_summarization()
File "/home/ctbrown/sourmash/src/sourmash/tax/tax_utils.py", line 1724, in check_summarization
raise ValueError("lineages not summarized yet.")
ValueError: lineages not summarized yet.
key point is this reproduces the problem that The following are missing from the taxonomy information:
... which to me looks like at least one critical issue!
...oh. duh. smalltax.csv
looks like this:
ident,superkingdom,phylum,class,order,family,genus,species
AY230188.1.2014,Eukaryota,Amorphea,Amoebozoa,Myxogastria,Symphytocarpus impexus,NA,NA
and it looks like the problem is that we're not removing the >
at the beginning. Sigh.
@bryshalm in your tax_python_scriptaddna.sh
, suggest adding a tax = tax[0].str.slice(start=1)
right after the first str.split. Give that a try and let me know how it goes ;)
import pandas as pd
tax_file= "SILVA_138_taxonomy.txt"
tax = pd.read_csv(tax_file, header=None, sep="\t")
tax = tax[0].str.split(' ',n=1, expand=True)
tax.set_index([0], inplace=True)
tax = tax[1].str.split(';',n=6, expand=True)
tax.rename(columns={0:'superkingdom', 1: 'phylum', 2: 'class', 3:'order', 4:'family', 5: 'genus', 6:'species'}, inplace=True)
tax.index.name = "ident"
tax_out = "SILVA_138_taxonomy.csv"
tax.to_csv(tax_out, na_rep="NA")
After including the tax = tax[0].str.slice(start=1)
in tax_python_scriptaddna.py
in the ~brymoore/sourmash_16S/silva
directory.
import pandas as pd
tax_file= "SILVA_138_taxonomy.txt"
tax = pd.read_csv(tax_file, header=None, sep="\t")
tax = tax[0].str.split(' ',n=1, expand=True)
tax = tax[0].str.slice(start=1)
tax.set_index([0], inplace=True)
tax = tax[1].str.split(';',n=6, expand=True)
tax.rename(columns={0:'superkingdom', 1: 'phylum', 2: 'class', 3:'order', 4:'family', 5: 'genus', 6:'species'}, inplace=True)
tax.index.name = "ident"
tax_out = "SILVA_138_taxonomy.csv"
tax.to_csv(tax_out, na_rep="NA")
And run
python3 tax_python_scriptaddna.py
I get this error message:
Traceback (most recent call last):
File "/home/brymoore/sourmash_16S/silva/tax_python_scriptaddna.py", line 7, in <module>
tax.set_index([0], inplace=True)
File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/pandas/core/generic.py", line 5902, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'set_index'. Did you mean: 'reset_index'?
When I try reset_index
import pandas as pd
tax_file= "SILVA_138_taxonomy.txt"
tax = pd.read_csv(tax_file, header=None, sep="\t")
tax = tax[0].str.split(' ',n=1, expand=True)
tax = tax[0].str.slice(start=1)
tax.reset_index([0], inplace=True)
tax = tax[1].str.split(';',n=6, expand=True)
tax.rename(columns={0:'superkingdom', 1: 'phylum', 2: 'class', 3:'order', 4:'family', 5: 'genus', 6:'species'}, inplace=True)
tax.index.name = "ident"
tax_out = "SILVA_138_taxonomy.csv"
tax.to_csv(tax_out, na_rep="NA")
And run
python3 tax_python_scriptaddna.py
I get:
Traceback (most recent call last):
File "/home/brymoore/sourmash_16S/silva/tax_python_scriptaddna.py", line 7, in <module>
tax.reset_index([0], inplace=True)
File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/home/brymoore/miniconda3/envs/sourmash_env/lib/python3.10/site-packages/pandas/core/series.py", line 1568, in reset_index
raise TypeError(
TypeError: Cannot reset_index inplace on a Series to create a DataFrame
(sourmash_env) brymoore@c8-92:~/sourmash_16S/silva$ nano tax_python_scriptaddna.py
(sourmash_env) brymoore@c8-92:~/sourmash_16S/silva$ python3 tax_python_scriptaddna.py
Traceback (most recent call last):
File "/home/brymoore/sourmash_16S/silva/tax_python_scriptaddna.py", line 8, in <module>
tax = tax[1].str.split(';',n=6, expand=True)
AttributeError: 'str' object has no attribute 'str'
I didn't edit the python script, but I did remove the >
using the command below.
cat SILVA_138_taxonomy.csv | sed “s/>//g” > SILVA_taxonomy.csv
I was then able to successfully use sourmash tax metagenome
with the SILVA_taxonomy.csv
file.
The identities of the query genome and the metagenome database are not matching. I am getting an error message indicating the genome is not present in the database when it is. Error message shown below.
Steps: Within the
/brymoore/sourmash_16S/sample_fasta_files
created a sqlite3 database to reduce time.Test
sourmash tax metagenome
with the sqldb.Output:
Checked for the presence of Symphytocarpus impexus within SILVA_138_taxonomy.txt.
Output:
Adapted
SILVA_138_taxonomy.csv
to include "NA" in empyt taxonomic rank positions. Believe the error is ocurring due to the difference in tax rank of some of the microorganisms. The standard seems to be 7 or 8 while the Eukrayota seem to have only 5. File location:brymoore/sourmash_16s/silva/tax_python_scriptaddna.sh
.tax_python_scriptaddna.sh:
Within the
/brymoore/sourmash_16S/sample_fasta_files
.Confirmed "NA" would be added to empty tax rank positions.Output:
Test
sourmash tax metagenome
again.Output:
Checked for the presence of KY379875 within the
SILVA_138_taxonomy.csv
.Output:
Checked for the presence of KY379875 within the query file.
Output:
Used sourmash tax metagenome with
--keep-full-identifiers
.Output: