Add support for single-end reads (in addition to paired-end reads)

alexhbnr commented 1 year ago

Description of feature

Due to the high fragmentation of ancient DNA data, there is not much gain in using long-read DNA sequencing methods for assembling ancient DNA samples. However, many ancient DNA samples have single-end short-read sequencing data and the two de novo assemblers, MEGAHIT and metaSPAdes, have the option to use these type of sequences as input.

Therefore it would be nice if one could provide this single-read short-read sequencing data in the column long_reads in the sample sheet to provide these and at the same time disable the long-read sequencing data steps.

d4straub commented 1 year ago

https://nf-co.re/mag/2.2.1/parameters#single_end isnt doing the job?

jfy133 commented 1 year ago

I think the point is more when you want to mix single-end sequenced libraries in the same run as libraries sequenced paired-end (and/or also possibly accounting for singletons)

alexhbnr commented 1 year ago

Yes, I am sorry for not clarifying it more, @d4straub. We have encounter a number of samples for which we have both single-end and paired-end data that belong to the same sample. Therefore, we would like to be able to assemble them together without treating the different sequencing data types as separate samples and having to perform a co-assembly.

d4straub commented 1 year ago

Ah I see, yes, thats a different approach. I oppose however to give short reads into a dedicated long read channel, that might make too much problems further down, confuse other developers and make the whole code less clear. Probably rather add a dedicated optional column for single ended short reads in the samplesheet? Or combine it with https://github.com/nf-core/mag/issues/358 to have multiple sequencing runs, including single and paired end libraries available?

jfy133 commented 1 year ago

Given for the the run merging suggested in #358 we would also have to change the samplesheet anyway, I think that while it would be 'more work', it would be more benefital to have a separate singletons column

jfy133 commented 1 year ago

Some observations from the current code:

metaSPAdes does not actually support single-end assembly, but we can include them as the 'orphaned' reads (via -s to a paired library of some form (I'm not really sure if this is best practise but :shrug:); even if it cannot do single-end assembly alone
- Implementation: singletons = length(reads) > 2 ? "-s ${reads[2]}" : ""
Will need to ensure fastp/adapterremoval supports exporting the "true"-singletons, and then merging with the uniq single-end libraries
- Question: how to associate which single-end library with which paired-end?
megahit does support single-end assembly, as it currently does in the pipeline. And as far as I can tell, also allows singletons with the pairs with -r, so would need to update the condition
```
def input = params.single_end ? "-r \"" + reads1.join(",") + "\"" : "-1 \"" + reads1.join(",") + "\" -2 \"" + reads2.join(",") + "\""
```

to have all three if a reads3 is present

The biggest blocker at the moment is I'm not sure how we would do the mapping... I guess it would require some form of running the mapping of each set of reads independently, then merging of BAMs which then is passed ot depth calculations

alexhbnr commented 1 year ago

Some additional commets to yours, @jfy133:

you are right, metaSPAdes does not allow to have only single-end data but at most allows for adding these type of data besides paired-end sequencing data. The logic that you implemented above makes sense to me. However, we might need to catch the exception that someone wants to use metaSPAdes and doesn't provide paired-end data, in case this is not implemented yet
regarding the fastp/adapterremoval, do you need to associate the single-end library with a paired-end one after all? You could process each paired-end and single-end library separately and only merge them after the alignment step on the sample level. At this point, the sample ID is relevant but not the library ID. Or do I miss something here.

jfy133 commented 1 year ago

Some additional commets to yours, @jfy133:

you are right, metaSPAdes does not allow to have only single-end data but at most allows for adding these type of data besides paired-end sequencing data. The logic that you implemented above makes sense to me. However, we might need to catch the exception that someone wants to use metaSPAdes and doesn't provide paired-end data, in case this is not implemented yet

:+1:

regarding the fastp/adapterremoval, do you need to associate the single-end library with a paired-end one after all? You could process each paired-end and single-end library separately and only merge them after the alignment step on the sample level. At this point, the sample ID is relevant but not the library ID. Or do I miss something here.

Uhhh good point. No I thikn you might be right... I think I had something in my head about the groups, but then the groups can be associated anyway.

nf-core / mag

Add support for single-end reads (in addition to paired-end reads) #359

Description of feature