Closed andrewjmc closed 3 years ago
hi @andrewjmc, what happens if you do a positive control, e.g. run with something that is definitely in the database?
(It shouldn't be because of the 4.0 release, and the databases are backwards compatible with 3.5.x. - but of course there may be a bug!)
Yeah, really silly of me not to try that first! An E. coli genome works fine with the k=31 database. I guess I just felt optimistic that something should come up when I searched all the MAGs from a human throat sample with k=21!
:) well, good to know that worked! thanks for reporting back!
I too am mildly surprised that there are no matches! A few thoughts -
--threshold=0
?) would find something. Alternatively searching all the GTDB genomes (which we don't currently provide in prepared database format, sorry!) could work better, too.gather
and not search
, as search
takes into account the size of the query MAG and if it has a lot of extra "stuff" in it this will lower the score. (Still, --threshold=0
for search should overcome this if there are ANY shared hashes).we're planning on releasing a full set of GTDB genomes in a database format soonish, but might be a few months; we need to merge a few things that are still work-in-progress first.
Threshold 0 made no difference, for this sample. Is lca sample
based on search
? I'll play around with gather
next.
Plenty to learn!
lca sample
- do you mean summarize
, here? or something else?
We (well, @bluegenes) have also been working on protein searches that (spoiler alert) do a very nice job of supporting more distant taxonomy seraches. If you are up for it, I'd love to try running some of your zero match signatures against these databases. (We have been hesitant to publicize this / make this functionality available because we are still benchmarking it and understanding the details, but our confidence is pretty high at this point, so maybe it's time to decloak!)
If you're interested in trying it out, please compute signatures like so:
sourmash sketch translate -p scaled=100,k=7,k=9,k=10,k=11
and send them to me at ctbrown@ucdavis.edu, and we'll send you back results. We're also happy to try to figure out how to make the databases available to you, but this is enough of a moving target that it might make sense for us to just send you the results and then you can engage more closely if the results are interesting!
Yes, you're right, I meant lca summarize
!
I'd love to send some signatures your way and will do it ASAP!
Any preliminary way of exploring many MAGs would be great -- the existing tools are computationally intensive, and fall down frequently on HPC (in my hands at least). I've got ~30 samples with MAGs run from a while ago, but over 200 with assemblies now, just awaiting MAG generation.
Thanks for the offer.
Best wishes,
Andrew
yes, this is a key sourmash use case!
MAGs in process (single assembly and MAG generation, rather than co-assembly)
Hello,
I'm testing sourmash 4.0.0 for identifying components of MAGs.
I ran
sourmash lca summarize --query xxx.sig --db gtdb-release89-k21.lca.json.gz -o xxx._sourmash_GTDB.csv
This is using the first release of the GTDB LCA databases you produced (the later release seems to include only the SBT databases).
The commands ran fine but across 74 MAGs, no results were found and the output CSVs are empty.
Could this be because of breaking changes with version 3.x, and if so is it possible to recreate the databases?
Thanks for your advice,
Andrew