sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
477 stars 79 forks source link

doco: GTDB R207 database inconsistency #3118

Open wwood opened 7 months ago

wwood commented 7 months ago

Hi there,

I've been having some trouble getting R207 databases to work with soumash tax metagenome. I'm using 4.8.8 from conda.

After running sketch, the instructions at https://sourmash.readthedocs.io/en/latest/tutorial-lemonade.html#id7 say

# use tax metagenome to classify the metagenome
sourmash tax metagenome -g SRR8859675.x.gtdb.csv \
    -t gtdb-rs207.taxonomy.sqldb -F human -r order

$ sourmash tax metagenome --gather-csv GCA_020052375.1_genomic.gather_gtdbrs207_reps.csv --taxonomy ../../output_sourmash/sourmash/data/gtdb-rs207.taxonomy.sqldb

There doesn't appear to be any *.sqldb available, now we should just use the taxonomy CSV?

OK, so

The lineage spreadsheet (for sourmash tax commands) is available at the species level and at the genome level.

"I only need to species reps" I think, so I'll just download the first one. But that fails:

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 gather results from 'GCA_020052375.1_genomic.gather_gtdbrs207_reps.csv'.
loaded results for 1 queries from 1 gather CSVs
of 1 gather results, lineage assignments for 1 results were missed.
The following are missing from the taxonomy information: GCF_000299365
Starting summarization up rank(s): strain, species, genus, family, order, class, phylum, superkingdom
Traceback (most recent call last):
  File "/home/woodcrob/e/sourmash-v4.8.8/bin/sourmash", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/__main__.py", line 20, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/cli/tax/metagenome.py", line 150, in main
    return sourmash.tax.__main__.metagenome(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/tax/__main__.py", line 203, in metagenome
    tax_utils.write_summary(
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 1124, in write_summary
    header, summary = q_res.make_full_summary(
                      ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 2581, in make_full_summary
    self.check_summarization()
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 2542, in check_summarization
    raise ValueError("lineages not summarized yet.")
ValueError: lineages not summarized yet.

The genome one worked, so I got there in the end.

I'm a bit confused why the species one has ident entries along the lines of s__Escherichia_coli when sketch doesn't generate IDs of this type. Maybe I'm missing something.

Anyway, HTH, ben

ctb commented 7 months ago

There doesn't appear to be any *.sqldb available, now we should just use the taxonomy CSV?

There are some instructions above that need to be run - see "Let’s index the taxonomy database using SQLite, for faster access later on:".

sourmash tax prepare -t gtdb-rs207.taxonomy.csv \
    -o gtdb-rs207.taxonomy.sqldb -F sql

That having been said, you can use the taxonomy CSV too! It'll just take longer to load each time.

"I only need to species reps" I think, so I'll just download the first one.

Right! It needs to match the content of the database you're searching, which (in this case) is all of the GTDB genomes, not just the species-level representatives. We'll fix the tutorial to make this clear!

The download link is in the tutorial, under "We also want to download the accompanying taxonomy spreadsheet:"

But that fails:

Well, and our error message certainly need some help... we'll fix, thanks!

I'm a bit confused why the species one has ident entries along the lines of s__Escherichia_coli when sketch doesn't generate IDs of this type. Maybe I'm missing something.

Oh dear, that does look incorrect to me - I wonder why we did that... I'll see if I can fix. Thank you very much for reporting all of this!

ctb commented 7 months ago

Fixing link to species database here: https://github.com/sourmash-bio/sourmash/pull/3119

wwood commented 7 months ago

Thanks for the quick response @ctb - makes sense - fine by me to close this issue.