taxprofiler / taxpasta

TAXnomic Profile Aggregation and STAndardisation
https://taxpasta.readthedocs.io/
Apache License 2.0
33 stars 7 forks source link

[Feature] Add support for Sourmash #114

Open Midnighter opened 1 year ago

Midnighter commented 1 year ago

I think sourmash is an interesting tool, as it is so fast in scanning vast libraries of genomes. We should add support for its output.

chrisgulvik commented 3 months ago

@Midnighter Any progress on this? I'm also interested in having it for evaluation.

According to the docs sourmash tax is the recommended approach (not sourmash lca anymore) link. I don't have example commands to use, but there's a fairly recent nf wf that might be helpful if it's the cmds themselves is what's slowing you down here. It looks like the main steps are:

  1. sketch the input to form a sig (sourmash sketch) here
  2. search the sig against a db (sourmash gather) here
  3. summarize results by lineage (sourmash tax metagenome) here
  4. annotate results (sourmash tax annotate) here

where steps 3 and 4 could occur in parallel.

The bioconda is up-to-date here, databases are well-described here, and the software itself is very well maintained by @ctb et al. for almost a decade now. Also including him to give an opportunity to suggest alternative cmds for generalized classification, in case the above steps are less than ideal.

ctb commented 3 months ago

I STAND READY

😆

ctb commented 3 months ago

can anyone give an example of one or two use cases so I can read the docs a bit with that in mind? would the standardize command be a good place to start?

might be fun to add sylph support as well, since people are liking that a lot (I'm not a maintainer - that would be @bluenote-1577)

Midnighter commented 3 months ago

Thank you for your interest @chrisgulvik 🙂. As this is taxpasta and not the taxprofiler pipeline, the exact commands actually don't matter in this context. The only thing required from a technical perspective are examples of a few profiles created with sourmash and maybe a clear understanding what variation in terms of column output is possible/desirable/supportable.

The major impediment is my time really, as I have moved into a different job, and taxpasta is now essentially a hobby project among (several) others. We have a fairly decent guide for how to add support for new types of profiles (https://taxpasta.readthedocs.io/en/latest/contributing/supporting_new_profiler/), so if you want to give it a shot, I'm happy to provide guidance and review code.

jfy133 commented 3 months ago

Agreed! A sourmash subwork was actually already started on the taxprofiler repo (it's in a draft state at the moment), but the person taking that on seems to have not been able to finish it. On 'our side's we normally we add tools to taxpasta once it's in the pipeline as then we know exactly what is available etc.

That said I'm also happy to guide on the taxprofiler/nextflow side of things (I'm still on half tjme parental leave until August) , if someone wants to take over the half done subworkflow! We have a profiler -contribution guide for that too

And agreed sylph also looks very interesting 👍