nf-core / mag

Assembly and binning of metagenomes
https://nf-co.re/mag
MIT License
192 stars 102 forks source link

Add support for single-end reads (in addition to paired-end reads) #359

Open alexhbnr opened 1 year ago

alexhbnr commented 1 year ago

Description of feature

Due to the high fragmentation of ancient DNA data, there is not much gain in using long-read DNA sequencing methods for assembling ancient DNA samples. However, many ancient DNA samples have single-end short-read sequencing data and the two de novo assemblers, MEGAHIT and metaSPAdes, have the option to use these type of sequences as input.

Therefore it would be nice if one could provide this single-read short-read sequencing data in the column long_reads in the sample sheet to provide these and at the same time disable the long-read sequencing data steps.

d4straub commented 1 year ago

https://nf-co.re/mag/2.2.1/parameters#single_end isnt doing the job?

jfy133 commented 1 year ago

I think the point is more when you want to mix single-end sequenced libraries in the same run as libraries sequenced paired-end (and/or also possibly accounting for singletons)

alexhbnr commented 1 year ago

Yes, I am sorry for not clarifying it more, @d4straub. We have encounter a number of samples for which we have both single-end and paired-end data that belong to the same sample. Therefore, we would like to be able to assemble them together without treating the different sequencing data types as separate samples and having to perform a co-assembly.

d4straub commented 1 year ago

Ah I see, yes, thats a different approach. I oppose however to give short reads into a dedicated long read channel, that might make too much problems further down, confuse other developers and make the whole code less clear. Probably rather add a dedicated optional column for single ended short reads in the samplesheet? Or combine it with https://github.com/nf-core/mag/issues/358 to have multiple sequencing runs, including single and paired end libraries available?

jfy133 commented 1 year ago

Given for the the run merging suggested in #358 we would also have to change the samplesheet anyway, I think that while it would be 'more work', it would be more benefital to have a separate singletons column

jfy133 commented 1 year ago

Some observations from the current code:

to have all three if a reads3 is present

alexhbnr commented 1 year ago

Some additional commets to yours, @jfy133:

jfy133 commented 1 year ago

Some additional commets to yours, @jfy133:

  • you are right, metaSPAdes does not allow to have only single-end data but at most allows for adding these type of data besides paired-end sequencing data. The logic that you implemented above makes sense to me. However, we might need to catch the exception that someone wants to use metaSPAdes and doesn't provide paired-end data, in case this is not implemented yet

:+1:

  • regarding the fastp/adapterremoval, do you need to associate the single-end library with a paired-end one after all? You could process each paired-end and single-end library separately and only merge them after the alignment step on the sample level. At this point, the sample ID is relevant but not the library ID. Or do I miss something here.

Uhhh good point. No I thikn you might be right... I think I had something in my head about the groups, but then the groups can be associated anyway.