sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 79 forks source link

fixing and debugging `LCA_SqliteDatabase` problems #2407

Open ctb opened 1 year ago

ctb commented 1 year ago

(Should this maybe be moved over to https://github.com/sourmash-bio/sourmash-examples/?)

So @bluegenes built an LCA_SqliteDatabase using:

sourmash lca index gtdb-rs207.taxonomy.with-strain.csv \
    gtdb-rs207.protein.k10.scaled1000.lca.sql \
    gtdb-rs207.protein.k10.scaled1000.zip \
    -F sql --protein -k 10 --no-dna --scaled 1000

and got the following warnings:

WARNING: 12139 duplicate signatures.
WARNING: no lineage provided for 305403 signatures.
WARNING: no signatures for 317542 spreadsheet rows.
WARNING: 317542 unused lineages.
WARNING: 317542 unused identifiers.

Yay warnings! Boo warnings!

What's going on?

First: fixing it.

No taxonomies were loaded, and sourmash SQLite LCA databases without taxonomies are just sourmash SQLite databases, so it should be possible to add a taxonomy ;).

So I tried:

sourmash tax prepare -t \
    /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.taxonomy.with-strain.csv \
    -o gtdb-rs207.protein.k10.scaled1000.lca.sql \
    -F sql

and got back:

ERROR while saving!
taxonomy table already exists in 'gtdb-rs207.protein.k10.scaled1000.lca.sql'

because even though the taxonomy table was empty, they still existed.

So I ran the sqlite3 command line interface -

sqlite3 gtdb-rs207.protein.k10.scaled1000.lca.sql

and then inside sqlite,

sqlite> select * from sourmash_internal;

which returned:

SqliteIndex|1.0
SqliteManifest|1.0
SqliteLineage|1.0

The key/value pair SqliteLineage plus the existence of the table sourmash_taxonomy were preventing sourmash tax prepare from running.

The following commands removed these:

sqlite> delete from sourmash_internal where key='SqliteLineage';
sqlite> drop table sourmash_taxonomy;

and then I could re-run:

sourmash tax prepare -t \
    /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.taxonomy.with-strain.csv \
    -o gtdb-rs207.protein.k10.scaled1000.lca.sql \
    -F sql

Now sourmash tax summarize produces good results:

loading taxonomies...
...loaded 317542 entries.
number of distinct taxonomic lineages: 317542
rank superkingdom:        2 distinct taxonomic lineages
rank phylum:              189 distinct taxonomic lineages
rank class:               481 distinct taxonomic lineages
rank order:               1593 distinct taxonomic lineages   
rank family:              4107 distinct taxonomic lineages   
rank genus:               16686 distinct taxonomic lineages  
rank species:             65703 distinct taxonomic lineages  
rank strain:              317542 distinct taxonomic lineages 

as does sourmash sig summarize:

path filetype: LCA_SqliteDatabase
location: gtdb-rs207.protein.k10.scaled1000.lca.sql
is database? yes
has manifest? yes
num signatures: 305403
** examining manifest...
total hashes: 318937259
summary of sketches:
   305403 sketches with protein, k=10, scaled=1000    318937259 total hashes

yay w00t!

why did the sourmash lca index fail!?

Let's diagnose that - first, produce a smaller taxonomy:

head -20 /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.taxonomy.with-strain.csv \
    > subtax.csv

and subset the database:

sourmash sig cat gtdb-rs207.protein.k10.scaled1000.lca.sql --picklist subtax.csv:ident:ident -o subsig.zip

then this reproduces the problem, but, like, faster ;) -

sourmash lca index subtax.csv \
    subsig.lca.sql subsig.zip -F sql \
    --protein -k 10 --no-dna --scaled 1000

adding a --report out.txt, I see:

Unused identifiers: 19
GCF_014333955.1
GCF_002513785.1
GCF_900482335.1
GCF_008388435.1
...
No lineage provided for these identifiers: 19
GCF_000566285.1 s__Escherichia coli
GCA_900448105.1 s__Escherichia coli
GCF_003460375.1 s__Escherichia coli
...

so it looks like identifiers are not being split, oops.

Fixed with:

sourmash lca index subtax.csv subsig.lca.sql subsig.zip \
    -F sql --protein -k 10 --no-dna --scaled 1000 \
    --keep-identifier-versions --split-identifiers

which then yields

19 distinct identities in spreadsheet out of 19 rows.
19 distinct lineages in spreadsheet out of 19 rows.
... loaded 1 signatures.
loaded 3887 hashes at ksize=10 scaled=1000
19 assigned lineages out of 19 distinct lineages in spreadsheet.
19 identifiers used out of 19 distinct identifiers in spreadsheet

tada!

An alternate (and safer?) construction method -

First, build a SQLite database:

sourmash sig cat subsig.zip -o try.lca.sqldb

then add taxonomy:

sourmash tax prepare --keep-identifier-versions \
    -t subtax.csv -o try.lca.sqldb -F sql

Double check - do all these work?

Extract a query:

sourmash sig cat --include GCF_000459755.1 subsig.zip -o query.sig

Check subsig.lca.sql:

% sourmash lca classify --query query.sig --db subsig.lca.sql

...

loaded 1 LCA databases. ksize=10, scaled=1000 moltype=protein
finding query signatures...
outputting classifications to -
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
GCF_000459755.1 s__Escherichia coli,found,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia coli,GCF_000459755.1
classified 1 signatures total

Check try.lca.sqldb:

sourmash lca classify --query query.sig --db try.lca.sqldb 

...

loaded 1 LCA databases. ksize=10, scaled=1000 moltype=protein
finding query signatures...
outputting classifications to -
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
GCF_000459755.1 s__Escherichia coli,found,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia coli,GCF_000459755.1
classified 1 signatures total

Check full GTDB:

 sourmash lca classify --query query.sig --db gtdb-rs207.protein.k10.scaled1000.lca.sql

...

loaded 1 LCA databases. ksize=10, scaled=1000 moltype=protein
finding query signatures...
outputting classifications to -
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
notify doneying GCF_000459755.1 s__Escherichia coli (file 1 of 1)
GCF_000459755.1 s__Escherichia coli,found,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia coli,
classified 1 signatures total

yay! w00t! it all works!

Exploring questions -

Why did we need --no-dna? Or did we?

sourmash lca index subtax.csv try2.lca.sql subsig.zip \
    -F sql --protein -k 10 --scaled 1000 \
    --keep-identifier-versions --split-identifiers

followed by

sourmash lca classify --query query.sig --db try2.lca.sql 

worked:

== This is sourmash version 4.6.2.dev7+g1c8b1659. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 LCA databases. ksize=10, scaled=1000 moltype=protein
finding query signatures...
outputting classifications to -
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
GCF_000459755.1 s__Escherichia coli,found,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia coli,GCF_000459755.1
classified 1 signatures total

so Tessa was just being very cautious ;).

ctb commented 1 year ago

ref https://github.com/sourmash-bio/sourmash/issues/2198 - the tax code and commands are much more straightforward than the lca code these days!

ctb commented 1 year ago

note: on farm, all of this was done in ~ctbrown/scratch/2022-tessa-lca-db-fix