mergpe: Pair up uneven paired FASTQ files [feature request]

lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats

MIT License

1.37k stars 308 forks source link

mergpe: Pair up uneven paired FASTQ files [feature request] #82

Open sjackman opened 8 years ago

sjackman commented 8 years ago

I have a pair of FASTQ files so that the paired-end reads no longer line up. I pair them up and drop the orphaned reads like so:

cat reads_1.fq reads_2.fq | paste - - - - | sort | tr '\t' '\n' | seqtk dropse >reads.fq

It would be nice to have a command like seqtk mergpe that pairs up uneven paired FASTQ files.

nathanhaigh commented 8 years ago

Since your input reads are likely in the same order, with the exception of reads which have been orphaned, you can avoid the costly (time and memory) sort by using process substitution on each input file to make the reads occupy a single line each and then paste the lines alternating from each input file:

paste -d'\n' <(paste - - - - < reads_1.fq) <(paste - - - - < reads_2.fq) \
  | tr '\t' '\n' \
  | seqtk dropse \
> reads.fq

You would only need to use sort in the event the ordering of the input files have been tampered, with other than just dropping of reads to create orphans.

sjackman commented 8 years ago

Your command doesn't work as is, but it gave me an idea. It would work if the two FASTQ files were already sorted by read ID, and the first paste command was replaced with a merge sort.

Here's an example where the paste of pastes doesn't work:

❯❯❯ cat read1.fa
>1/1
ACGT
>2/1
ACGT
>3/1
ACGT
>4/1
ACGT
>5/1
ACGT
❯❯❯ cat read2.fa
>1/2
ACGT
>3/2
ACGT
>5/2
ACGT
❯❯❯ paste -d '\n' <(paste - - <read1.fa) <(paste - - <read2.fa) | tr '\t' '\n' | seqtk dropse
>1/1
ACGT
>1/2
ACGT
>3/2
ACGT
>3/1
ACGT

Note that read 5 is missing from the output, and read 3 is in an atypical order (read 2 then read 1). This works:

❯❯❯ sort -m <(paste - - <read1.fa) <(paste - - <read2.fa) | tr '\t' '\n' | seqtk dropse
>1/1
ACGT
>1/2
ACGT
>3/1
ACGT
>3/2
ACGT
>5/1
ACGT
>5/2
ACGT

nathanhaigh commented 8 years ago

I did imply the input reads need to be in the same order :)

The out-of-order and missing read might be because I didn't account for unequal lengths of the two input files which results in empty lines being piped to seqtk dropse. These empty lines might be confusing seqtk dropse and giving the odd behaviour:

paste -d '\n' <(paste - - <read1.fa) <(paste - - <read2.fa) | tr '\t' '\n'
>1/1
ACGT
>1/2
ACGT
>2/1
ACGT
>3/2
ACGT
>3/1
ACGT
>5/2
ACGT
>4/1
ACGT

>5/1
ACGT

Not sure how much CPU time the merge sort takes (minimal I expect) but it shouldn't be needed if the blank lines are removed. (Maybe you can test this as I haven't got access to seqtk at the moment:

paste -d '\n' <(paste - - <read1.fa) <(paste - - <read2.fa) | sed '/^$/d' | tr '\t' '\n' | seqtk dropse

sjackman commented 8 years ago

I did say in the original post I have a pair of FASTQ files so that the paired-end reads no longer line up. :wink: If the two files have the same number of reads in the same order, the solution is to use seqtk mergepe rather than seqtk dropse.

Blank lines have no effect on seqtk dropse, and removing them likewise has no effect. The merge sort is required for seqtk dropse. The challenge is that Illumina reads are not (I believe) sorted lexicographically, and so sort -m does not work without first sorting the reads lexicographically, which is slow.

nathanhaigh commented 8 years ago

I should have paid more attention to your example input read names. You're right. To avoid an explicit sort, a bit more programming logic is required to resolve this.

sjackman commented 8 years ago

seqtk mergpe could be made smarter so that it does a merge sort of the two files. It would need an option to specify either a lexicographical sort (like sort -m) or an Illumina read name sort. Then the pipe seqtk mergpe | seqtk dropse would work as desired.

alienzj commented 5 years ago

https://github.com/linsalrob/fastq-pair

alienzj commented 5 years ago

Hello, how to get PE reads and SE reads from uneven paired FASTQ files just using seqtk?

alienzj commented 5 years ago

https://github.com/jvivian/fastq_pair