sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

No results from sourmash lca summarize with GTDB database #1447

Closed andrewjmc closed 3 years ago

andrewjmc commented 3 years ago

Hello,

I'm testing sourmash 4.0.0 for identifying components of MAGs.

I ran sourmash lca summarize --query xxx.sig --db gtdb-release89-k21.lca.json.gz -o xxx._sourmash_GTDB.csv

This is using the first release of the GTDB LCA databases you produced (the later release seems to include only the SBT databases).

The commands ran fine but across 74 MAGs, no results were found and the output CSVs are empty.

Could this be because of breaking changes with version 3.x, and if so is it possible to recreate the databases?

Thanks for your advice,

Andrew

ctb commented 3 years ago

hi @andrewjmc, what happens if you do a positive control, e.g. run with something that is definitely in the database?

(It shouldn't be because of the 4.0 release, and the databases are backwards compatible with 3.5.x. - but of course there may be a bug!)

andrewjmc commented 3 years ago

Yeah, really silly of me not to try that first! An E. coli genome works fine with the k=31 database. I guess I just felt optimistic that something should come up when I searched all the MAGs from a human throat sample with k=21!

ctb commented 3 years ago

:) well, good to know that worked! thanks for reporting back!

I too am mildly surprised that there are no matches! A few thoughts -

we're planning on releasing a full set of GTDB genomes in a database format soonish, but might be a few months; we need to merge a few things that are still work-in-progress first.

andrewjmc commented 3 years ago

Threshold 0 made no difference, for this sample. Is lca sample based on search? I'll play around with gather next.

Plenty to learn!

ctb commented 3 years ago

lca sample - do you mean summarize, here? or something else?

We (well, @bluegenes) have also been working on protein searches that (spoiler alert) do a very nice job of supporting more distant taxonomy seraches. If you are up for it, I'd love to try running some of your zero match signatures against these databases. (We have been hesitant to publicize this / make this functionality available because we are still benchmarking it and understanding the details, but our confidence is pretty high at this point, so maybe it's time to decloak!)

If you're interested in trying it out, please compute signatures like so:

sourmash sketch translate -p scaled=100,k=7,k=9,k=10,k=11

and send them to me at ctbrown@ucdavis.edu, and we'll send you back results. We're also happy to try to figure out how to make the databases available to you, but this is enough of a moving target that it might make sense for us to just send you the results and then you can engage more closely if the results are interesting!

andrewjmc commented 3 years ago

Yes, you're right, I meant lca summarize!

I'd love to send some signatures your way and will do it ASAP!

Any preliminary way of exploring many MAGs would be great -- the existing tools are computationally intensive, and fall down frequently on HPC (in my hands at least). I've got ~30 samples with MAGs run from a while ago, but over 200 with assemblies now, just awaiting MAG generation.

Thanks for the offer.

Best wishes,

Andrew

ctb commented 3 years ago

yes, this is a key sourmash use case!

andrewjmc commented 3 years ago

MAGs in process (single assembly and MAG generation, rather than co-assembly)