nf-core / mag

Assembly and binning of metagenomes
https://nf-co.re/mag
MIT License
209 stars 106 forks source link

Add separate Nanopore input option #275

Open ljmesi opened 2 years ago

ljmesi commented 2 years ago

Is your feature request related to a problem? Please describe

At the moment there is an input option to use only short reads.

Describe the solution you'd like

An additional option to use Nanopore long reads fastq files as input would be good to have.

d4straub commented 2 years ago

Totally agree. That would be a major enhancement, also some significant work though. It requires an nanopore-only assemblers, e.g. flye, and mapping processes for contig/mag quantification. If you have favorite programs, let us know. This is definitely on my wish-list as well, but it might take a while. You are welcome to add it if you feel like it!

ljmesi commented 2 years ago

Thank you for your feedback @d4straub! I'm wondering that maybe these parts initially could be included:

  1. Add Nanopore reads as input to the pipeline
  2. Use Porechop for adapter/quality trimming
  3. Remove host reads with minimap2
  4. Classify the reads with centrifuge/kraken Do these sound good steps to take in order to chop this broader task into smaller subtasks?
d4straub commented 2 years ago

Oh, there might be a slight misunderstanding:

Currently, there is already the possibility to use Nanopore data, but only in addition to Illumina data, not on its own (i.e. Nanopore-only). Also, adapter trimming (Porechop), quality visualisation (NanoPlot), quality filtering (Filtlong), Lambda read removal (NanoLyse) are implemented. Direct host read removal is not yet implemented, only via Filtlong, that depends on Illumina reads (i.e. Nanopore reads that are not covered by Illumina reada are discarded, i.e. when Illumina reads do not have host data, filtered Nanopore reads will also not). Hybrid (Illumina & Nanopore data) assembly is realized with hybridSPAdes already.

Having said that, the pipeline does not (yet) support Nanopore-only assembly, and this is what I was referring to. In case there are no Illumina reads, Filtlong doesnt work (because the settings require Illumina reads) and no Nanopore-only assembler (such as flye) is implemented in the pipeline yet. Additionally, Nanopore reads are currently not used in centrifuge/kraken (that might be relatively easy to add, actually).

ljmesi commented 2 years ago

Thank you @d4straub for the clarification! So if I understand correctly there should be a standalone way of having Nanopore reads fastq files as input. I'm working for Genomic Medicine Sweden and we've been hoping to have a Nanopore-only reads classification directly without assembly (if that seems suitable for this pipeline, possibly using centrifuge/kraken2). Would these steps seem okay additions to the pipeline? At least with kraken2, we have experience of using it with Nanopore reads for classification with seemingly good results.

d4straub commented 2 years ago

Yes, there could be a way of having Nanopore reads fastq files without Illumina data as input. And that would be desirable in this pipeline. But it would be important that those Nanopore reads are not taken only for Kraken2 but also for assembly, because this is an assembly focused pipeline. And as far as I understand, assembly is not your primary objective (please correct me if I am wrong).

There is a new pipeline in the making, see https://nf-co.re/taxprofiler, that is only focusing on taxonomic profiling. However, it might not allow Nanopore input yet, and it is under construction. So if you are not interested in assembly, and you consider implementing it yourself, I'd recommend to participate in nf-core/taxprofiler.

ljmesi commented 2 years ago

Thank you for your response @d4straub and thank you especially for the recommendation about taxprofiler! It looks like taxprofiler matches more accurately what we need in Genomic Medicine Sweden so I will contribute in adding the feature in taxprofiler instead. I will remove myself as an assignee but will not close the issue in case someone else would like to contribute in adding Nanopore assembly based classification.

abu85 commented 1 year ago

I thought my questions fit here. I have question regarding adding a pooled nanopore sample to the pipeline and question an subsequent analysis based on the previous one. I want to have hybrid comprehensive assembly from both short reads (illumina) and long reads (nanopore). but unfortunately I had to pool samples before nanopore sequencing, so i have fifty individual samples in short redas but one (combined) sample in nanopore. So my questions are

  1. which way i should add the nanopore sample in the samplesheet (my plan is to add this sample besides one of the he short read sample in samplesheet)?
  2. I would like to do binning groupwise on this hybrid assembly based on short reads samples, will this setup in the samplesheet make a problem later on here?
  3. How can i classify nanopore reads in this pipeline (there is no Kraken2 classification option for long reads)? any suggestion?
  4. I have so many fastq files in nanopore sample, should i combine them all into one before runing?

Thanks for your attention.

d4straub commented 1 year ago

That question would be better asked via nf-core slack (see https://nf-co.re/join) channel "mag". But because I am already here, short answers:

  1. once per row, i.e. once per illumina sample is the only way, but that would generate huge overhead in the pipeline. I am not not sure I got it right, but you can not make a co-assembly that way of course (using --coassemble_group).
  2. binning group wise is no problem, because it only depends on the short reads, it does not use the long reads.
  3. use nf-core/taxprofiler, now released
  4. yes, but your data is non-optimal (nanopore not separated into samples, are you sure that your "many fastq files" are not separated by sample, after all, also nanopore allows [de]multiplexing)
abu85 commented 1 year ago

Thanks,

  1. I want to utilize this pooled longreads sample (where a bit of every sample were merged into one), I thought I can include in the analysis to make the analysis be better but now it seems that i can not do so from your point, or i misunderstood? Do you suggest anything here? 4.no, they are not separetd by samples.
dawnmy commented 1 year ago

agree. it is important to support long reads only input data as long reads sequencing is becoming more and more popular

willros commented 1 year ago

Hi,

Any updates or fresh thoughts on adding a pure long-read track to the pipeline? I was checking out a few other nf-core pipelines and noticed that some, like viralrecon, have already embraced this idea. I would like to help set up a dedicated nanopore/long read track for this pipeline.

Should this discussion be moved to the Slack channel instead?

Thanks! William

d4straub commented 1 year ago

As far as I know there are no new thoughts except that the pipeline is getting huge, additions should be kept at a minimum. I still think that nanopore-only assembly should be possible within the nf-core/mag pipeline. General planning/updates should be here I think, more interactive discussion are more convenient in slack imho.

willros commented 1 year ago

Hi again,

We're a group of people involved in Clinical Genomics in Sweden, and we're eager to introduce a dedicated long read track for metagenomic genome assembly. After chatting with @jfy133 , we've decided to first get together to figure out what features and functionality we want to include, then we'll dive into the how and where of adding this new track.

We're well aware that there's an ongoing discussion about the existing code base, and it might be a bit tricky to shoehorn something new into the current metagenomic assembly process, especially with the potential need for significant changes and rebuilds. So, one idea would be to start fresh with a completely new pipeline for long read implementation.

Perhaps we can keep the discussion going here, so others can participate with their thoughts on architecture and functionality.

Thanks! William

jfy133 commented 1 year ago

Small comment for now: I don't think we need an entire re-write of the pipeline per se, but the purely long read functionality could be a separate fresh workflow (like viral recon with illuminata Vs nanopore data)