jstjohn / SeqPrep

Tool for stripping adaptors and/or merging paired reads with overlap into single reads.
MIT License
140 stars 51 forks source link

No alignment when adapters are found #16

Open mairae opened 10 years ago

mairae commented 10 years ago

I'm trying to merge reads with SeqPrep.

I have read 1:

@M01271:10:000000000-A3WGH:1:1101:15333:1723 1:N:0:1
GTTAATGTTTGGCTGCGAGGTCGGGGCGGACGGACGCTTCCTCCGCGGGTACCAGCAGGACGCCTATGATGGCGCTGACTACATCGCCCTGAATGAGGACCTGAGCTCCTGGACCGCGGCGGACACCGCGGCTCAGATCACCCGGCGCAAGTGGGAGGCGGCAGGTGCGGCTGAACAAGACAGAGCCTACCTGGGGGGC
+
BBBBAFFFFFF@GGGEGGGGGGGGGGGGGEGGGGGECFGEHCGHFFGGGEBHHHGAFGHEHGGGGGHHEHHHFDDGGGGHHGHHHGGGGGGGGHHHGHFGHFHHFHHHGHGGHCGDACGGGFF;BFFFFFFFFDFFFFFFFFFFFFDFF-@FFFE?ADFFADCAF:FFCF-DC;BFBFFBBFFBF/AFEFAFFFFFCF9

and read 2:

@M01271:10:000000000-A3WGH:1:1101:15333:1723 2:N:0:1
TGGTTGGCCTTCCCCATCTCCAGGTATCTGCGCAGCCACTCCACGCACGTGCCCCCCAGGTAGGCTCTGTCTTGTTCAGCCGCACCTGCCGCCTCCCACTTGCGCCGGGTGATCTGAGCCGCGGTGTCCG
+
BABBBCCFFFFFGGGFGGGGGGHHHFHHHHHGGGGGHHHHHHHHGGGGGHHHHHGGGGGGHFHGHHHHHHHHHHHHHHGGHGGGEGHHGHCGGGGGGHHHHHGGGGGCGFG/?GGFFFHHFGGGD@FHHD

Looking at them, they do have a perfect overlap.

Read 1
GTTAATGTTTGGCTGCGAGGTCGGGGCGGACGGACGCTTCCTCCGCGGGTACCAGCAGGACGCCTATGATGGCGCTGACTACATCGCCCTGAATGAGGACCTGAGCTCCTGGACCGCGGCGGACACCGCGGCTCAGATCACCCGGCGCAAGTGGGAGGCGGCAGGTGCGGCTGAACAAGACAGAGCCTACCTGGGGGGC
-----------------------------------------------------------------------------------------------------------------------CGGACACCGCGGCTCAGATCACCCGGCGCAAGTGGGAGGCGGCAGGTGCGGCTGAACAAGACAGAGCCTACCTGGGGGGCACGTGCGTGGAGTGGCTGCGCAGATACCTGGAGATGGGGAAGGCCAACCA
Read 2 reverse complement

Then I took the first few base pairs of each read as adapter sequences:

A GTTAATGTTTGGCTGCGA
B TGGTTGGCCTTCCCCATC

And used SeqPrep to merge them:

SeqPrep -f rabbit_R1_01.fastq -r rabbit_R2_01.fastq -1 R1.fastq.gz -2 R2.fastq.gz -3 R1.disc.fastq.gz -4 R2.disc.fastq.gz -s merged.fastq.gz -E align.fastq.gz -A 'GTTAATGTTTGGCTGCGA' -B 'TGGTTGGCCTTCCCCATC'

But it seems not to be able to do the alignment. Am I missing something?

Processing reads... |
Pairs Processed:        1
Pairs Merged:   0
Pairs With Adapters:    1
Pairs Discarded:        1

(When I put something like CCCCCCCCCCCCC as adapters, the reads get merged.)

Thanks a lot, Marie

jstjohn commented 10 years ago

Hello,

What do you mean that you take the first few base-pairs of the read as the adapter sequence?

For the alignment you drew, I think that I would not expect there to be adapter sequence present. Adapter sequence shows up when you have the opposite case, something like this (excuse the bogus alignment):

read1: -------AAAAAAAAAAAA read2: AAAAAAAAAAA---------

In this case the overhang is read2 coming before read1. This is what happens when the underlying fragment is shorter than the read length.

If you use the first few BP of read1 as the “adapter” then SeqPrep will go ahead and assume that the pair of reads starts with the adapter sequence (is an adapter dimer), in which case you want to throw it out completely.

Make sense?

Thanks, John

On Apr 2, 2014, at 6:52 AM, mairae notifications@github.com wrote:

I'm trying to merge reads with SeqPrep.

I have read 1:

@M01271:10:000000000-A3WGH:1:1101:15333:1723 1:N:0:1 GTTAATGTTTGGCTGCGAGGTCGGGGCGGACGGACGCTTCCTCCGCGGGTACCAGCAGGACGCCTATGATGGCGCTGACTACATCGCCCTGAATGAGGACCTGAGCTCCTGGACCGCGGCGGACACCGCGGCTCAGATCACCCGGCGCAAGTGGGAGGCGGCAGGTGCGGCTGAACAAGACAGAGCCTACCTGGGGGGC + BBBBAFFFFFF@GGGEGGGGGGGGGGGGGEGGGGGECFGEHCGHFFGGGEBHHHGAFGHEHGGGGGHHEHHHFDDGGGGHHGHHHGGGGGGGGHHHGHFGHFHHFHHHGHGGHCGDACGGGFF;BFFFFFFFFDFFFFFFFFFFFFDFF-@FFFE?ADFFADCAF:FFCF-DC;BFBFFBBFFBF/AFEFAFFFFFCF9 and read 2:

@M01271:10:000000000-A3WGH:1:1101:15333:1723 2:N:0:1 TGGTTGGCCTTCCCCATCTCCAGGTATCTGCGCAGCCACTCCACGCACGTGCCCCCCAGGTAGGCTCTGTCTTGTTCAGCCGCACCTGCCGCCTCCCACTTGCGCCGGGTGATCTGAGCCGCGGTGTCCG + BABBBCCFFFFFGGGFGGGGGGHHHFHHHHHGGGGGHHHHHHHHGGGGGHHHHHGGGGGGHFHGHHHHHHHHHHHHHHGGHGGGEGHHGHCGGGGGGHHHHHGGGGGCGFG/?GGFFFHHFGGGD@FHHD Looking at them, they do have a perfect overlap.

Read 1 GTTAATGTTTGGCTGCGAGGTCGGGGCGGACGGACGCTTCCTCCGCGGGTACCAGCAGGACGCCTATGATGGCGCTGACTACATCGCCCTGAATGAGGACCTGAGCTCCTGGACCGCGGCGGACACCGCGGCTCAGATCACCCGGCGCAAGTGGGAGGCGGCAGGTGCGGCTGAACAAGACAGAGCCTACCTGGGGGGC -----------------------------------------------------------------------------------------------------------------------CGGACACCGCGGCTCAGATCACCCGGCGCAAGTGGGAGGCGGCAGGTGCGGCTGAACAAGACAGAGCCTACCTGGGGGGCACGTGCGTGGAGTGGCTGCGCAGATACCTGGAGATGGGGAAGGCCAACCA Read 2 reverse complement Then I took the first few base pairs of each read as adapter sequences:

A GTTAATGTTTGGCTGCGA B TGGTTGGCCTTCCCCATC And used SeqPrep to merge them:

SeqPrep -f rabbit_R1_01.fastq -r rabbit_R2_01.fastq -1 R1.fastq.gz -2 R2.fastq.gz -3 R1.disc.fastq.gz -4 R2.disc.fastq.gz -s merged.fastq.gz -E align.fastq.gz -A 'GTTAATGTTTGGCTGCGA' -B 'TGGTTGGCCTTCCCCATC' But it seems not to be able to do the alignment. Am I missing something?

Processing reads... | Pairs Processed: 1 Pairs Merged: 0 Pairs With Adapters: 1 Pairs Discarded: 1 (When I put something like CCCCCCCCCCCCC as adapters, the reads get merged.)

Thanks a lot, Marie

— Reply to this email directly or view it on GitHub.

mairae commented 10 years ago

Hello John, thanks a lot for your quick response!

It is true, that I don't necessarily want/expect to see adapter sequences at the beginning of a read in the SeqPrep alignment. The alignment I drew did not come from SeqPrep though and was just to demonstrate that I would expect that the two reads in question should be able to be merged.

My data mostly looks like this:

Read 1 AAAAACCCCCCCCCCGG (adapter 1; sequence; part of rc of adapter 2)
Read 2 CCCCCGGGGGGGGGGTTT (adapter 2; rc of sequence; part of rc of adapter 1)

(So the fragment should not be shorter than the read lengths, but it's only a few bases longer.) Then I would assume the alignment to be like

AAAAACCCCCCCCCCGG---
---AAACCCCCCCCCCGGGGG

or

------CCCCCCCCCCGG---
---AAACCCCCCCCCC-----

with trimmed adapter sequences.

Is this not what I should be expecting? I thought, the reads might get discarded because the score of the alignment is not good enough. That's why I selected a read pair that should have a perfect overlap. Additionally, I used the first bases of read 1 as forward primer (-A) and the first bases of read 2 as reverse primer (-B). (Those sequences are very similar to the real primer sequences I'm using, but for testing I did not want to have any mismatches.) So, in the toy example, the forward primer would be AAAAA and the reverse primer CCCCC.

And on a related note: I think, in an earlier version, it was possible, to put empty strings as adapter sequences (-A '' -B ''), but it does not seem to be working anymore. (Similar problem: SeqPrep reports the reads as "Pairs With Adapters" and as "Pairs Discarded".) This would be useful for me, since I'm using different different adapters and don't necessarily need SeqPrep to strip the adapters for me. Is there any way to do this?

Best regards, Marie

jstjohn commented 10 years ago

There wouldn’t be an or situation like you drew below. The way illumina adapters work it would always be the second case if indeed adapter sequences are present. They always happen after the genomic DNA in the first read. When there is no genomic DNA present, then it happens at the beginning of the first read. If the adapter happens at the beginning of the first read then there should not be any genomic DNA present, unless you are following a non-standard paired-end illumina protocol, like long insert mate-pair or something.

On Apr 3, 2014, at 4:53 AM, mairae notifications@github.com wrote:

Hello John, thanks a lot for your quick response!

It is true, that I don't necessarily want/expect to see adapter sequences at the beginning of a read in the SeqPrep alignment. The alignment I drew did not come from SeqPrep though and was just to demonstrate that I would expect that the two reads in question should be able to be merged.

My data mostly looks like this:

Read 1 AAAAACCCCCCCCCCGG (adapter 1; sequence; part of rc of adapter 2) Read 2 CCCCCGGGGGGGGGGTTT (adapter 2; rc of sequence; part of rc of adapter 1) (So the fragment should not be shorter than the read lengths, but it's only a few bases longer.) Then I would assume the alignment to be like

AAAAACCCCCCCCCCGG--- ---AAACCCCCCCCCCGGGGG or

------CCCCCCCCCCGG--- ---AAACCCCCCCCCC----- with trimmed adapter sequences.

Is this not what I should be expecting? I thought, the reads might get discarded because the score of the alignment is not good enough. That's why I selected a read pair that should have a perfect overlap. Additionally, I used the first bases of read 1 as forward primer (-A) and the first bases of read 2 as reverse primer (-B). (Those sequences are very similar to the real primer sequences I'm using, but for testing I did not want to have any mismatches.) So, in the toy example, the forward primer would be AAAAA and the reverse primer CCCCC.

And on a related note: I think, in an earlier version, it was possible, to put empty strings as adapter sequences (-A '' -B ''), but it does not seem to be working anymore. (Similar problem: SeqPrep reports the reads as "Pairs With Adapters" and as "Pairs Discarded".) This would be useful for me, since I'm using different different adapters and don't necessarily need SeqPrep to strip the adapters for me. Is there any way to do this?

Best regards, Marie

— Reply to this email directly or view it on GitHub.