jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
357 stars 78 forks source link

GTDB with SqueezeMeta #635

Closed mscarbor closed 9 months ago

mscarbor commented 1 year ago

Long time user, first time poster... Thanks for all that you and your team do.

Are there any suggestions or plans for incorporating GTDB taxonomic classifications with SqueezeMeta? As of now, it looks like a GTDB database is not available. I would love to be able to assign taxonomy to long reads with SqueezeMeta. It's a big ask, I know, but I'm throwing it out there.

fpusan commented 1 year ago

You can assign taxonomy to long reads with SqueezeMeta (see the script in utils/sqm_longreads.pl) without assembly. However I actually have plans on incorporatic gtdb for MAG/pangenome analysis, hopefully later this year

jtamames commented 1 year ago

Hello Matt

Thanks for being a loyal user! We would love to integrate GTDB into SqueezeMeta. For now the enormous size of the databases they are using is preventing us to do so. But I promise to rethink this issue and see if we can do it somehow. As Fernando likes to say, stay tuned.

Best,

J

On 1/3/23 21:35, Matt Scarborough wrote:

Long time user, first time poster... Thanks for all that you and your team do.

Are there any suggestion/ plans to incorporate GTDB taxonomic classifications with SqueezeMeta? As of now, it looks like a GTDB database is not available. I would love to be able to assign taxonomy to long reads with SqueezeMeta. It's a big ask, I know, but I'm throwing it out there.

— Reply to this email directly, view it on GitHub https://github.com/jtamames/SqueezeMeta/issues/635, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIUX7OZYWCY5GRY44EV2DLWZ6XJPANCNFSM6AAAAAAVMQ26IE. You are receiving this because you are subscribed to this thread.Message ID: @.***>

mscarbor commented 1 year ago

Javier/ Fernando-- Appreciate the quick responses! Makes sense that this would be a significant effort... Reach out if you ever need a letter from a user/PI to support the effort.

fpusan commented 11 months ago

Work towards this has started in a0ae6a7, though for now we are using it to classify bins rather than long reads. I see that you'd like to use it to classify individual long reads. Have you seen it used like that elsewhere?

mscarbor commented 11 months ago

No, actually! Since the GTDB is based on just the concatenated sequences of their marker genes, this wouldn't actually work, would it? Having it for bins would be amazing though :-)

On Mon, Sep 18, 2023 at 11:38 AM Fernando Puente-Sánchez < @.***> wrote:

Work towards this has started in a0ae6a7 https://github.com/jtamames/SqueezeMeta/commit/a0ae6a7b3f0fd12af9b1b07b581aef431d747494, though for now we are using it to classify bins rather than long reads. I see that you'd like to use it to classify individual long reads. Have you seen it used like that elsewhere?

— Reply to this email directly, view it on GitHub https://github.com/jtamames/SqueezeMeta/issues/635#issuecomment-1723733930, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6P2V3GJSJCGWLEYYMPGIDX3BTH5ANCNFSM6AAAAAAVMQ26IE . You are receiving this because you authored the thread.Message ID: @.***>

fpusan commented 9 months ago

Just found this again and though a bit more about it.

It may be cool if the reads are long enough to contain many marker genes but in practice your mileage may vary since for now PacBio reads are not that long. Also I am not sure whether the different genes have the same phylogenetic resolution (I wouldn't be surprised if they don't).

So in practice I suspect this method would work very differently for different reads (from no classification at all to super accurate classification, and a range between both, depending on the number and "power" of the marker genes contained on each read). It may still a cool idea but I don't think it is something we are going to prioritize in the near future.

GTDB-Tk for bins is however already working in our dev version! https://anaconda.org/fpusan/squeezemeta-dev For now the GTDB-Tk results are added to the bin table as an extra column (if the --gtdbtk flag was provided in the initial SqueezeMeta.pl call. I expect this to make into an "official" release sometime in Q1 2024, once we have time to do more testing and bring all the new features to our usual levels of stability (which is to say, somewhat stable...)

fpusan commented 8 months ago

However someone else seems to be doing what you suggested... https://www.biorxiv.org/content/10.1101/2023.12.17.572079v1

mscarbor commented 7 months ago

Ok, amazing. Thanks for sharing this.

On Tue, Dec 19, 2023 at 2:47 AM Fernando Puente-Sánchez < @.***> wrote:

However someone else seems to be doing what you suggested... https://www.biorxiv.org/content/10.1101/2023.12.17.572079v1

— Reply to this email directly, view it on GitHub https://github.com/jtamames/SqueezeMeta/issues/635#issuecomment-1862271562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6P2V3ISCTZYSWGG43A573YKFBBXAVCNFSM6AAAAAAVMQ26IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRSGI3TCNJWGI . You are receiving this because you authored the thread.Message ID: @.***>