jdidion / atropos

An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Other
120 stars 15 forks source link

Errors with NULL FASTQ sequences #43

Closed timp0 closed 6 years ago

timp0 commented 6 years ago

Found a strange bug from some recent HiSeq data:

When running: atropos --aligner insert -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT \ -o /shared/data/trim1.fq.gz -p /shared/data/trim2.fq.gz \ -pe1 /shared/data/test1.fq.gz -pe2 /shared/data/test2.fq.gz \ -u 10 -U 7 -q 25 \ --op-order ACQ

On attached paired fq (minimal file to demonstrate) I get this error

Traceback (most recent call last): File "/shared/conda/lib/python3.6/site-packages/atropos/commands/base.py", line 74, in handle_records self.handle_record(context, record) File "/shared/conda/lib/python3.6/site-packages/atropos/commands/base.py", line 127, in handle_record return self.handle_reads(context, read1, read2) File "/shared/conda/lib/python3.6/site-packages/atropos/commands/trim/init.py", line 52, in handle_reads return self.record_handler.handle_record(context, read1, read2) File "/shared/conda/lib/python3.6/site-packages/atropos/commands/trim/init.py", line 70, in handle_record reads = self.modifiers.modify(read1, read2) File "/shared/conda/lib/python3.6/site-packages/atropos/commands/trim/modifiers.py", line 1054, in modify read1, read2 = mods(read1, read2) File "/shared/conda/lib/python3.6/site-packages/atropos/commands/trim/modifiers.py", line 363, in call match = self.aligner.match_insert(read1.sequence, read2.sequence) File "/shared/conda/lib/python3.6/site-packages/atropos/align/init.py", line 280, in match_insert seq2_rc = reverse_complement(seq2) File "/shared/conda/lib/python3.6/site-packages/atropos/util/init.py", line 428, in reverse_complement return "".join(BASE_COMPLEMENTS[base] for base in reversed(seq)) File "/shared/conda/lib/python3.6/site-packages/atropos/util/init.py", line 428, in return "".join(BASE_COMPLEMENTS[base] for base in reversed(seq)) KeyError: '\x00'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/shared/conda/lib/python3.6/site-packages/atropos/util/init.py", line 674, in run_interruptible func(*args, **kwargs) File "/shared/conda/lib/python3.6/site-packages/atropos/commands/base.py", line 23, in call self.process_batch(batch) File "/shared/conda/lib/python3.6/site-packages/atropos/commands/base.py", line 58, in process_batch self.handle_records(context, records) File "/shared/conda/lib/python3.6/site-packages/atropos/commands/trim/init.py", line 48, in handle_records super().handle_records(context, records) File "/shared/conda/lib/python3.6/site-packages/atropos/commands/base.py", line 78, in handle_records idx, context['index'])) from err atropos.AtroposError: An error occurred at record 998 of batch 1

It seems to coordinate with some empty fq records: @170830_GRCF3_0496_AHTCV7BCXY:2:2216:21251:101305/1 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@\ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ + #################################################################################################### @170830_GRCF3_0496_AHTCV7BCXY:2:2216:21251:101345/1 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@\ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ + ####################################################################################################

In contrast, trim_galore and other trimming software doesn't seem to have a problem with it.

test2.fq.gz test1.fq.gz

timp0 commented 6 years ago

I'll add that I think this could be because of my op-order - ACQ is probably trying to force the adaptor alignment first before trimming for quality or other useful things which would eliminate this problem.

jdidion commented 6 years ago

Thanks - it's useful to have this tested with other op orders. One of the many things I need to add unit tests for :)

In any case, this shouldn't throw an exception. Looking into it today.

jdidion commented 6 years ago

Ah, this is an interesting case - those records aren't actually empty, but are filled with null bytes. This is the problem with FASTQ not being a spec - it's unclear whether that should be allowed. It seems to me that an empty record should have both sequence and quality strings of length zero.

I added the Alphabet class, which encapsulates information about a particular sequence alphabet. Currently the only concrete instance is 'dna'. There is now an '--alphabet' option. If you set this to 'dna', sequences will be checked against the alphabet as they are read. Any bases not in the alphabet will be changed to the default character ('N'). This should fix your issue, as all of the null bytes will be converted to 'N's.