jdidion / atropos

An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Other
120 stars 15 forks source link

legacy_report stops as soon as an adapter sequence is not trimmed #57

Closed cokelaer closed 6 years ago

cokelaer commented 6 years ago

In principle, when we use a list of adapters, we get a report that looks like (if txt format is required):

--------
Trimming
--------

Pairs                                    records     fraction
----------------------------------- ------------ ------------
Total read pairs processed:              864,879
  Read 1 with adapter:                   106,932        12.4%
  Read 2 with adapter:                       694         0.1%
Pairs that were too short:               130,547        15.1%
Pairs written (passing filters):         734,332        84.9%

Base pairs                                    bp     fraction
----------------------------------- ------------ ------------
Total bp processed:                  174,705,558
  Read 1:                             87,352,779
  Read 2:                             87,352,779
End Ns trimmed                               251         0.0%
  Read 1:                                    219         0.0%
  Read 2:                                     32         0.0%
Quality-trimmed                       17,022,625         9.7%
  Read 1:                              6,813,448         7.8%
  Read 2:                             10,209,177        11.7%
Total bp written (filtered):         143,928,089        82.4%
  Read 1:                             72,399,685        82.9%
  Read 2:                             71,528,404        81.9%

followed by a section that looks like this for each adapters:

----------------------------------------------------
First read: Adapter Universal_Adapter|name:universal
----------------------------------------------------

Sequence                                                   Type           Length Trimmed (x)
---------------------------------------------------------- -------------- ------ -----------
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT variable 5'/3'     58         550

147 times, it overlapped the 5' end of a read
403 times, it overlapped the 3' end or was within the read

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-58 bp: 5

Overview of removed sequences (5'):
length count  expect max.err error counts
                             0  1  2     
------ ----- ------- ------- ------------
     6    68 1,266.9       0 68          
     7    31   316.7       0 31          
    10    14    39.6       1 1  13       

Now, in this case, the number of trimmed adapters is 550. If it was 0, the the whole report process is stopped. I believe this is a bug that occurs on line 582:

if adapter["total"] == 0:
    return

that should be

adapter["total"] == 0:
   continue
pkMyt1 commented 6 years ago

I have also found a potential bug in that file that is coming from line 721

def _print_base_histogram(title, hist, extra_width=4, index_name='Pos'):

When hist is None it throws an error and stops the report. Adding

if hist is None: _print("No Data") return

Allows the report to complete. I am not sure if this is related to the bug above or not.

jdidion commented 6 years ago

Thanks to both of you! These fixes will go out in v1.1.16 today.