nf-core / mag

Assembly and binning of metagenomes
https://nf-co.re/mag
MIT License
192 stars 102 forks source link

Enter at postbinning stage #623

Open prototaxites opened 1 month ago

prototaxites commented 1 month ago

Description of feature

Decided I wanted to try and bin my data using Vamb, which isn't in the pipeline yet. Would be a useful feature to be able to supply a csv (or directory?) of bins and jump directly into the bin QC/taxonomy/annotation steps!

Might try my hand at this one if I find a bit of time, but I suspect that it will be finnicky depending on what exact metadata we will want to tag onto the input bins and that might need some discussion.

jfy133 commented 1 month ago

You can try, but tbh it likely would be faster to just add vamb to the pipeline 🤣🤣🤣

prototaxites commented 1 month ago

Having looked at Vamb, as it (ideally) require concatenating all assemblies and renaming contigs along a complicated scheme - I think it's going to play havoc with any system that's comparing bins using contig names (DAS_Tool and Tiara)... 😅

jfy133 commented 1 month ago

Uuugghhhhh

jfy133 commented 1 month ago

I guess we will need to make a metadata file to track them or something and covert headers back?

jfy133 commented 1 month ago

"Concatenate the FASTA files together while making sure all contig headers stay unique"

If that's all it's doing, might be a reasonable thing to do upstream immediately after assembly anyway thinking about it...

prototaxites commented 1 month ago

Furthermore, if you want to use binsplitting (and you should!), your contig headers must be of the format {Samplename}{Separator}{X}, such that the part of the string before the first occurrence of {Separator} gives a name of the sample it originated from. For example, you could call contig number 115 from sample number 9 "S9C115", where "S9" would be {Samplename}, "C" is {Separator} and "115" is {X}.

So it's a little more complicated! I'm not sure if renaming all the contigs initially is the best solution disk-space wise - as we just create a copy of all assemblies with different headers for a tool that we (potentially) might not choose to run...

Not to mention mapping the reads to the concatenated assembly, and then parsing that separately through the depths workflow 🫢

jfy133 commented 1 month ago

Furthermore, if you want to use binsplitting (and you should!), your contig headers must be of the format {Samplename}{Separator}{X}, such that the part of the string before the first occurrence of {Separator} gives a name of the sample it originated from. For example, you could call contig number 115 from sample number 9 "S9C115", where "S9" would be {Samplename}, "C" is {Separator} and "115" is {X}.

So it's a little more complicated! I'm not sure if renaming all the contigs initially is the best solution disk-space wise - as we just create a copy of all assemblies with different headers for a tool that we (potentially) might not choose to run...

Not to mention mapping the reads to the concatenated assembly, and then parsing that separately through the depths workflow 🫢

Ugh ok.

It's weird though as earlier the documentation implies you don't have to do all of that?

I don't have a good suggestion then 😅, sounds like it'll all be painful one way or another...

maxibor commented 1 week ago

As a stopgap measure, I've written mgenotatte to do just that: genome QC, dereplication, and taxonomic annotation https://github.com/maxibor/mgenottate

prototaxites commented 1 week ago

As a stopgap measure, I've written mgenotatte to do just that: genome QC, dereplication, and taxonomic annotation https://github.com/maxibor/mgenottate

Ah, that's cool! In the end I just ended up forking mag, deleting the first part of the main workflow and dropping bins in via directory input: https://github.com/prototaxites/mag/tree/bin_entry

Also, I have a separate pipeline for metagenome gene annotation that is just a couple of characters different in name from yours: https://github.com/prototaxites/mgannotate 😅