lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.55k stars 556 forks source link

Error handling of empty sequence #205

Open caleblareau opened 6 years ago

caleblareau commented 6 years ago

I ran into this today (using 0.7.16a)

(venv3) [cl322@rgs13 Exp73]$ cat bwa_logs/*
...
[mem_sam_pe] paired reads have different names: "GGTAACAGGTGTTTTAACGCCTTCGCA_NB501440:390:HMWWJBGX5:2:23308:24787:17425", "TGGACCAAGACCATGATCACCTGTGTA_NB501440:390:HMWWJBGX5:2:23308:21500:17425"

Hmm, okay, reads are not pairing properly across fastq files, so investigating further:

(venv3) [cl322@rgs13 processed]$ zcat <  N704-Exp73-Sample4_S1_L002-parse006_1.fastq.gz | sed -n '1~4p'  |  grep NB501440:390:HMWWJBGX5:2:23308:24787:17425 -n
2640:@GGTAACAGGTGTTTTAACGCCTTCGCA_NB501440:390:HMWWJBGX5:2:23308:24787:17425 1:N:0:TCCTGAGC
(venv3) [cl322@rgs13 processed]$ zcat <  N704-Exp73-Sample4_S1_L002-parse006_2.fastq.gz | sed -n '1~4p'  |  grep NB501440:390:HMWWJBGX5:2:23308:24787:17425 -n
2640:@GGTAACAGGTGTTTTAACGCCTTCGCA_NB501440:390:HMWWJBGX5:2:23308:24787:17425 2:N:0:TCCTGAGC
(venv3) [cl322@rgs13 processed]$ zcat <  N704-Exp73-Sample4_S1_L002-parse006_2.fastq.gz | sed -n '1~4p'  |  grep HMWWJBGX5:2:23308:21500:17425 -n
2639:@TGGACCAAGACCATGATCACCTGTGTA_NB501440:390:HMWWJBGX5:2:23308:21500:17425 2:N:0:TCCTGAGC
(venv3) [cl322@rgs13 processed]$ zcat <  N704-Exp73-Sample4_S1_L002-parse006_1.fastq.gz | sed -n '1~4p'  |  grep HMWWJBGX5:2:23308:21500:17425 -n
2639:@TGGACCAAGACCATGATCACCTGTGTA_NB501440:390:HMWWJBGX5:2:23308:21500:17425 1:N:0:TCCTGAGC

The implicated read names appear to be in the right place in each file. Here, I was frustrated since the error message didn't match what I was seeing. Finally, I just went to look at the sequences:

(venv3) [cl322@rgs13 processed]$ zcat < N704-Exp73-Sample4_S1_L002-parse006_1.fastq.gz | head -10560 | tail -n 10
@ACAGTACTTTCCTATTGGACTCTATTA_NB501440:390:HMWWJBGX5:2:23308:3742:17424 1:N:0:TCCTGAGC
GCGTTCGGTTTTCTACCTTTGCTTGTTCCTGAGGAA
+
EEEEEEEEEEEEEEAEAE<EE<</AAAAAEAAEAEE
@TGGACCAAGACCATGATCACCTGTGTA_NB501440:390:HMWWJBGX5:2:23308:21500:17425 1:N:0:TCCTGAGC

+
A6A/AEEEA///EEEEAAAEEA</E/</<///A//AEEAAEE//<66AE<<A<E/EEE/E/E//<<6/6///66<A//6/<EA/AE<<E/<6//EEAEEAEEAE<//6<//<AAA/E/
@GGTAACAGGTGTTTTAACGCCTTCGCA_NB501440:390:HMWWJBGX5:2:23308:24787:17425 1:N:0:TCCTGAGC
CTTCTTCGGCGAAGGCTACACTTACGGCGCCCAGCT
+
EEEEEEEEEA<EAEEEEEA<EAAEAAE/A<<<AEAA

ah ha; an empty sequencing sequencing line (this was the result of a homebrew trimming system that clearly failed here).

Nothing fundamentally wrong with bwa, but it took me embarrassingly long to find this particular error. I don't have an obvious recommendation for what to do better. If anything, but I wanted to document this behavior.