nf-core / eager

A fully reproducible and state-of-the-art ancient DNA analysis pipeline
https://nf-co.re/eager
MIT License
134 stars 79 forks source link

Possiblity to extract non-host paired-end reads in separate FastQ files #945

Open alexhbnr opened 1 year ago

alexhbnr commented 1 year ago

Currently, nf-core/eager does only allow to extract reads that didn't align to the provided reference genome into a single FastQ file. While this is no issue when having merged overlapping read pairs during adapter removal, non-merged sequencing data might still consist of read pairs and single reads at the same time. This simple extraction with samtools view -f 4 /path/to/bam | samtools fastq will then lead to a single file from which it is almost impossible to retrace which are the read pairs and which are the singletons.

Instead nf-core/eager should automatically switch to a different extraction command when confronted with non-merged read pairs:

samtools view -f 4 /path/to/bam | samtools fastq -1 /path/to/fwd_fastq -2 /path/to/rev_fastq -0 /path/to/singleton_fastq

The more complex (but more accurate) solution would be the following:

samtools sort -n -o <sample>.name_sorted.bam
samtools fixmate -mcu <sample>.name_sorted.bam <sample>.fixmate.bam
samtools view -uh -e '(flag.paired && (flag.unmap || flag.munmap)) || (!flag.paired && flag.unmap)' <sample>.fixmate.bam | \
samtools fastq -1 /path/to/fwd_fastq -2 /path/to/rev_fastq -0 /path/to/singleton_fastq

The first two step ensure that paired reads have complementary flags, which is particularly important with respect to the "unmapped" and "mate unmapped" flag. The samtools view command then uses a filtering expression to extract unmapped reads. It distinguishes between read pairs and singletons. The former are extracted if at least one of the mates of the pair couldn't be aligned. The latter are simply extracted when they are unmapped.

Using the more complex approach will ensure that the FastQ file with the forward reads and the FastQ file with the reverse reads of a paired-end sequencing run will have the same number of reads.