jdidion / atropos

An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Other
120 stars 15 forks source link

Adapter removal with wildcard N bases behaves unexpectedly with insert alignment #64

Closed gmagoon closed 6 years ago

gmagoon commented 6 years ago

I tried adapter removal for 2x151 bp Illumina data using the --aligner insert option, and I'm getting unexpected behavior when using wildcard N bases in the specified adapter sequence.

I show here results of two runs on 1 million read pairs, one with wildcard N for the 6-bp variable barcode sequence for adapter 1: -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNACATCTCGTATGCCGTCTTCTGCTTG ...and one with the actual 6-bp barcode sequence for the data under consideration: -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGTATGCCGTCTTCTGCTTG Adapter 2 is specified as: -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT At the end of this post, I include results for each run (with the only difference being the -a sequence). The results from using the wildcard N are unexpected, in the sense that:

  1. The trimmed reads count is lower (albeit marginally) when using the sequence with wildcard N. I would expect the trim count with wildcard matching would be higher, if anything.
  2. In the wildcard case, looking at the "Overview of removed sequences" as a function of removed sequence length, the error counts show a sharp jump starting with 65 bp removed sequences, where the mode jumps from 0 errors to 4 errors. (Note that the specified adapter 1 is 66 bp in length.)

When using the default aligner (i.e. --aligner adapter), the results (not shown) are consistent with my expectations. (I'm aware that the two alignment approaches have some fundamental differences.)

Unless I'm overlooking something obvious, this seems like it might be a bug of some sort, but it isn't immediately obvious to me how it is happening. Perhaps I'm misunderstanding something about the role of the specified adapter sequences in atropos when insert alignment is used?

Wildcard N

---------------------
First read: Adapter 1
---------------------

Sequence                                                           Type       Length Trimmed (x)
------------------------------------------------------------------ ---------- ------ -----------
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNACATCTCGTATGCCGTCTTCTGCTTG regular 3'     66      37,723

No. of allowed errors:
0-33 bp: 0; 34-66 bp: 1

Bases preceding removed adapters:
  A          25.9%
  C          20.5%
  G          23.6%
  T          29.9%
  none/other  0.0%

Overview of removed sequences:
length count     expect max.err error counts            
                                0    1  2  3 4  5  6 7 8
------ ----- ---------- ------- ------------------------
     1 1,227  250,000.0       0 1227                    
     2 1,196   62,500.0       0 1196                    
     3 1,066   15,625.0       0 1066                    
     4 1,078    3,906.2       0 1078                    
     5 1,007      976.6       0 1007                    
     6 1,078      244.1       0 1078                    
     7 1,031       61.0       0 1031                    
     8 1,042       15.3       0 1042                    
     9   976        3.8       0 971  5                  
    10 1,023        1.0       0 1015 8                  
    11   933        0.2       0 931  2                  
    12   874        0.1       0 868  6                  
    13   865        0.0       0 852  13                 
    14   719        0.0       0 713  6                  
    15   806        0.0       0 795  11                 
    16   780        0.0       0 774  6                  
    17   792        0.0       0 787  5                  
    18   756        0.0       0 746  10                 
    19   725        0.0       0 716  9                  
    20   668        0.0       0 656  12                 
    21   685        0.0       0 673  12                 
    22   600        0.0       0 589  11                 
    23   612        0.0       0 602  10                 
    24   534        0.0       0 528  6                  
    25   556        0.0       0 547  9                  
    26   478        0.0       0 463  13 2               
    27   551        0.0       0 540  10 1               
    28   566        0.0       0 539  22 5               
    29   521        0.0       0 510  8  3               
    30   561        0.0       0 550  11                 
    31   497        0.0       0 485  11 1               
    32   443        0.0       0 436  7                  
    33   431        0.0       0 423  8                  
    34   403        0.0       0 389  11 3               
    35   408        0.0       1 308  97 3               
    36   381        0.0       1 308  55 18              
    37   394        0.0       1 337  48 9               
    38   388        0.0       1 304  65 19              
    39   395        0.0       1 330  52 13              
    40   377        0.0       1 313  52 12              
    41   368        0.0       1 308  51 9               
    42   302        0.0       1 252  42 8               
    43   279        0.0       1 228  42 9               
    44   332        0.0       1 267  46 15 4            
    45   271        0.0       1 223  30 16 2            
    46   284        0.0       1 227  43 13 1            
    47   272        0.0       1 223  34 8  7            
    48   285        0.0       1 238  39 7  1            
    49   305        0.0       1 254  40 8  3            
    50   297        0.0       1 246  37 8  6            
    51   252        0.0       1 203  35 7  7            
    52   245        0.0       1 197  39 7  2            
    53   187        0.0       1 145  32 8  2            
    54   231        0.0       1 173  41 13 4            
    55   218        0.0       1 170  37 7  4            
    56   219        0.0       1 177  29 11 2            
    57   212        0.0       1 163  36 10 3            
    58   196        0.0       1 141  43 6  6            
    59   197        0.0       1 159  28 9  1            
    60   166        0.0       1 126  24 12 4            
    61   197        0.0       1 150  33 9  5            
    62   168        0.0       1 127  29 8  4            
    63   172        0.0       1 123  35 4  3 7          
    64   180        0.0       1 128  25 3  2 16 6       
    65   131        0.0       1 44   9  2  0 63 11 2    
    66   146        0.0       1 20   7  3  2 74 36 3 1  
    67   163        0.0       1 22   4  1  0 87 44 4 1  
    68   128        0.0       1 20   8  1  0 56 39 4    
    69   163        0.0       1 29   4  1  0 90 33 3 3  
    70   126        0.0       1 17   9  1  2 68 28 1    
    71   128        0.0       1 22   3  0  0 71 30 2    
    72   121        0.0       1 18   5  1  1 71 24 1    
    73   106        0.0       1 21   2  0  0 57 22 4    
    74   104        0.0       1 18   4  2  0 48 28 2 2  
    75   102        0.0       1 13   4  0  0 52 31 1 1  
    76   108        0.0       1 18   7  3  2 55 19 3 1  
    77    87        0.0       1 11   3  1  1 39 32      
    78   102        0.0       1 20   3  1  1 47 24 5 1  
    79    84        0.0       1 11   2  0  1 43 25 1 0 1
    80    92        0.0       1 12   1  1  0 49 26 3    
    81    94        0.0       1 10   4  2  0 56 22      
    82    89        0.0       1 11   0  1  0 49 25 2 0 1
    83    81        0.0       1 15   5  0  1 40 17 1 2  
    84    72        0.0       1 12   4  1  0 30 23 2    
    85    66        0.0       1 7    3  0  0 41 15      
    86    56        0.0       1 11   3  0  1 28 10 3    
    87    82        0.0       1 8    6  1  0 42 21 4    
    88    57        0.0       1 10   1  0  0 22 23 1    
    89    52        0.0       1 11   5  2  0 18 14 2    
    90    54        0.0       1 8    2  1  1 29 11 1 1  
    91    65        0.0       1 16   3  2  2 26 13 3    
    92    45        0.0       1 5    3  1  0 21 14 1    
    93    53        0.0       1 6    3  0  0 30 14      
    94    51        0.0       1 9    4  0  0 26 11 1    
    95    51        0.0       1 6    2  0  0 19 21 2 1  
    96    40        0.0       1 3    2  0  1 27 4  1 2  
    97    31        0.0       1 3    1  0  2 16 8  1    
    98    33        0.0       1 1    0  0  1 21 7  3    
    99    27        0.0       1 6    1  0  1 11 8       
   100    30        0.0       1 5    1  1  0 8  14 1    
   101    29        0.0       1 7    2  1  0 12 7       
   102    30        0.0       1 7    2  0  0 10 11      
   103    26        0.0       1 7    1  0  1 10 7       
   104    21        0.0       1 4    1  1  1 10 4       
   105    36        0.0       1 6    2  0  1 12 14 1    
   106    18        0.0       1 1    1  0  0 11 4  1    
   107    18        0.0       1 1    1  1  0 6  7  2    
   108    20        0.0       1 5    1  0  0 6  8       
   109    16        0.0       1 3    1  2  0 6  4       
   110    25        0.0       1 3    2  1  0 10 9       
   111    15        0.0       1 2    1  0  1 6  4  1    
   112    16        0.0       1 0    0  2  0 7  7       
   113    16        0.0       1 1    0  2  0 6  6  0 0 1
   114    17        0.0       1 5    3  1  0 2  6       
   115    10        0.0       1 2    2  1  1 3  1       
   116    16        0.0       1 3    0  0  0 5  7  1    
   117     6        0.0       1 2    1  1  0 1  1       
   118    12        0.0       1 5    3  1  0 2  1       
   119     6        0.0       1 1    0  0  0 5          
   120    12        0.0       1 1    2  1  0 4  3  0 0 1
   121     4        0.0       1 0    1  0  0 1  2       
   122     6        0.0       1 2    0  0  1 1  2       
   123     7        0.0       1 1    0  1  0 1  4       
   124     3        0.0       1 1    1  0  0 1          
   125     3        0.0       1 0    2  1               
   126     3        0.0       1 1    1  0  0 0  1       
   127     5        0.0       1 1    0  0  0 3  1       
   128     4        0.0       1 3    0  0  0 1          
   129     4        0.0       1 4                       
   130     5        0.0       1 1    3  0  0 1          
   131     2        0.0       1 0    1  0  0 0  1       
   132     1        0.0       1 0    1                  
   133     2        0.0       1 1    0  0  0 1          
   134     3        0.0       1 0    0  0  0 1  2       
   135     2        0.0       1 0    1  0  0 0  0  1    
   136     3        0.0       1 2    1                  
   137     3        0.0       1 1    1  0  0 0  0  1    
   138     4        0.0       1 2    1  0  0 1          
   140     1        0.0       1 0    0  1               
   141     1        0.0       1 0    0  0  0 0  1       
   142     2        0.0       1 1    1                  
   143     2        0.0       1 2                       
   147     2        0.0       1 0    2                  
   149     1        0.0       1 0    1                  
   151     7        0.0       1 3    4   

No wildcard

---------------------
First read: Adapter 1
---------------------

Sequence                                                           Type       Length Trimmed (x)
------------------------------------------------------------------ ---------- ------ -----------
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGTATGCCGTCTTCTGCTTG regular 3'     66      37,788

No. of allowed errors:
0-33 bp: 0; 34-66 bp: 1

Bases preceding removed adapters:
  A          25.9%
  C          20.5%
  G          23.6%
  T          29.9%
  none/other  0.0%

Overview of removed sequences:
length count     expect max.err error counts               
                                0    1  2  3 4 5 6 7 8 9 10
------ ----- ---------- ------- ---------------------------
     1 1,227  250,000.0       0 1227                       
     2 1,196   62,500.0       0 1196                       
     3 1,066   15,625.0       0 1066                       
     4 1,078    3,906.2       0 1078                       
     5 1,007      976.6       0 1007                       
     6 1,078      244.1       0 1078                       
     7 1,031       61.0       0 1031                       
     8 1,042       15.3       0 1042                       
     9   976        3.8       0 971  5                     
    10 1,023        1.0       0 1015 8                     
    11   933        0.2       0 931  2                     
    12   874        0.1       0 868  6                     
    13   865        0.0       0 852  13                    
    14   719        0.0       0 713  6                     
    15   806        0.0       0 795  11                    
    16   780        0.0       0 774  6                     
    17   792        0.0       0 787  5                     
    18   756        0.0       0 746  10                    
    19   725        0.0       0 716  9                     
    20   668        0.0       0 656  12                    
    21   685        0.0       0 673  12                    
    22   600        0.0       0 589  11                    
    23   612        0.0       0 602  10                    
    24   534        0.0       0 528  6                     
    25   556        0.0       0 547  9                     
    26   478        0.0       0 463  13 2                  
    27   551        0.0       0 540  10 1                  
    28   566        0.0       0 539  22 5                  
    29   521        0.0       0 510  8  3                  
    30   561        0.0       0 550  11                    
    31   497        0.0       0 485  11 1                  
    32   443        0.0       0 436  7                     
    33   431        0.0       0 423  8                     
    34   403        0.0       0 389  11 3                  
    35   407        0.0       1 393  12 2                  
    36   380        0.0       1 367  8  5                  
    37   394        0.0       1 380  12 2                  
    38   388        0.0       1 363  21 4                  
    39   396        0.0       1 378  16 2                  
    40   377        0.0       1 367  9  1                  
    41   368        0.0       1 357  8  3                  
    42   302        0.0       1 291  10 1                  
    43   280        0.0       1 267  11 2                  
    44   332        0.0       1 311  18 3                  
    45   272        0.0       1 261  9  1  1               
    46   285        0.0       1 244  36 5                  
    47   273        0.0       1 248  17 3  5               
    48   288        0.0       1 260  23 4  1               
    49   306        0.0       1 277  23 6                  
    50   298        0.0       1 270  23 4  1               
    51   252        0.0       1 229  20 1  2               
    52   246        0.0       1 222  22 2                  
    53   189        0.0       1 162  22 3  2               
    54   232        0.0       1 200  29 2  1               
    55   218        0.0       1 191  25 2                  
    56   219        0.0       1 196  18 4  1               
    57   213        0.0       1 187  23 3                  
    58   197        0.0       1 167  28 2                  
    59   202        0.0       1 178  14 9  1               
    60   169        0.0       1 116  39 9  4 1             
    61   199        0.0       1 127  48 17 7               
    62   169        0.0       1 112  33 19 5               
    63   175        0.0       1 104  47 16 5 3             
    64   183        0.0       1 109  53 12 4 4 0 0 0 1     
    65   131        0.0       1 70   43 12 4 2             
    66   147        0.0       1 77   41 17 5 5 1 1         
    67   166        0.0       1 97   47 12 6 2 0 0 1 1     
    68   128        0.0       1 62   43 17 3 1 1 0 1       
    69   166        0.0       1 106  36 10 9 4 0 1         
    70   126        0.0       1 75   36 7  4 4             
    71   129        0.0       1 76   31 17 4 0 0 0 0 1     
    72   121        0.0       1 80   26 9  3 2 1           
    73   108        0.0       1 67   24 10 4 2 0 1         
    74   104        0.0       1 58   30 8  2 3 1 1 1       
    75   102        0.0       1 58   35 3  3 2 0 1         
    76   108        0.0       1 61   22 13 8 0 2 1 1       
    77    88        0.0       1 46   32 2  6 0 0 2         
    78   104        0.0       1 62   27 10 2 1 0 0 0 2     
    79    84        0.0       1 47   24 4  2 4 1 1 1       
    80    94        0.0       1 53   26 11 2 2             
    81    95        0.0       1 59   23 5  4 1 2 1         
    82    91        0.0       1 53   25 9  1 2 0 0 0 1     
    83    82        0.0       1 47   20 6  4 2 2 1         
    84    73        0.0       1 37   25 8  1 1 0 0 1       
    85    66        0.0       1 43   16 3  3 0 1           
    86    57        0.0       1 30   10 8  3 1 4 1         
    87    82        0.0       1 44   26 9  1 1 0 0 0 1     
    88    58        0.0       1 27   24 4  1 0 2           
    89    52        0.0       1 21   15 9  4 1 2           
    90    55        0.0       1 31   12 3  4 3 0 1 0 0 1   
    91    66        0.0       1 34   14 7  7 2 1 0 1       
    92    45        0.0       1 23   15 4  0 2 0 1         
    93    54        0.0       1 32   16 2  4               
    94    51        0.0       1 31   14 3  1 0 1 1         
    95    52        0.0       1 20   23 5  3 1             
    96    40        0.0       1 28   4  1  5 0 0 2         
    97    31        0.0       1 16   9  4  1 0 1           
    98    33        0.0       1 22   7  3  1               
    99    28        0.0       1 13   9  3  1 0 1 0 0 0 1   
   100    30        0.0       1 11   13 3  1 1 0 1         
   101    30        0.0       1 17   8  3  1 0 0 1         
   102    30        0.0       1 11   11 4  1 2 0 0 1       
   103    27        0.0       1 13   8  2  3 1             
   104    21        0.0       1 11   4  3  0 1 1 1         
   105    37        0.0       1 15   15 1  3 2 0 0 1       
   106    18        0.0       1 11   5  1  0 1             
   107    18        0.0       1 7    8  3                  
   108    20        0.0       1 8    9  2  0 1             
   109    17        0.0       1 7    4  4  1 0 0 1         
   110    25        0.0       1 12   9  2  0 0 1 1         
   111    15        0.0       1 6    5  1  2 0 1           
   112    16        0.0       1 7    7  2                  
   113    16        0.0       1 6    6  1  0 2 1           
   114    18        0.0       1 3    8  2  3 1 0 0 1       
   115    10        0.0       1 4    2  2  1 0 0 0 0 0 0 1 
   116    16        0.0       1 7    7  1  0 1             
   117     7        0.0       1 2    2  0  1 0 0 0 1 1     
   118    12        0.0       1 6    3  1  1 0 0 0 1       
   119     6        0.0       1 5    0  0  1               
   120    12        0.0       1 5    3  0  0 2 0 0 1 1     
   121     4        0.0       1 1    2  0  1               
   122     6        0.0       1 2    2  1  0 0 0 0 1       
   123     7        0.0       1 1    4  1  0 0 1           
   124     3        0.0       1 1    0  0  0 1 1           
   125     3        0.0       1 0    1  1  0 1             
   126     3        0.0       1 0    1  0  0 1 1           
   127     5        0.0       1 3    1  0  0 0 0 0 1       
   128     4        0.0       1 1    0  0  0 1 1 1         
   129     4        0.0       1 3    0  0  0 0 0 0 1       
   130     5        0.0       1 2    2  0  0 0 0 1         
   131     3        0.0       1 0    2  1                  
   132     1        0.0       1 0    0  0  0 1             
   133     2        0.0       1 1    0  0  0 1             
   134     3        0.0       1 1    2                     
   135     2        0.0       1 0    1  1                  
   136     3        0.0       1 1    1  0  0 1             
   137     3        0.0       1 1    0  1  1               
   138     4        0.0       1 3    1                     
   140     1        0.0       1 0    0  0  0 1             
   141     1        0.0       1 0    1                     
   142     2        0.0       1 1    1                     
   143     2        0.0       1 2                          
   147     2        0.0       1 0    2                     
   149     1        0.0       1 0    1                     
   151     7        0.0       1 3    4 
jdidion commented 6 years ago

Thanks @gmagoon. Could you please also provide a minimal example dataset to replicate this issue? For example, a read that is trimmed correctly when you specify the barcode sequence, and incorrectly when you don't.

jdidion commented 6 years ago

I think I see the issue - InsertAligner is not respecting the 'match_adapter_wildcards' setting. Will be fixed in 1.1.18.

gmagoon commented 6 years ago

excellent, thanks John!

gmagoon commented 6 years ago

Hi John, It seems like the output is the same with v.1.1.18, except for the version number. I'll work on getting a representative example when I get a chance... Greg

gmagoon commented 6 years ago

Hi @jdidion , I've attached some fastq-format data containing ten readpairs. In these ten readpairs, using a 6-bp N wildcard results in no trimming in v.1.1.18, whereas using the barcode sequence results in trimming. [Note that it turns out that this case, and possibly previous as well, actually used 8-bp barcode, but that is neither here nor there (the last two bp have been specified in both wildcard and no-wildcard tests).] Here's the command I'm running: $ atropos trim -pe1 LNGU9.atropos.R1.fastq -pe2 LNGU9.atropos.R2.fastq --aligner insert -e 0.029 --insert-match-error-rate 0.058 -o /dev/null -p /dev/null -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNCGATCTCGTATGCCGTCTTCTGCTTG -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT ...and without the wildcard, I'm using -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACAAACATCGATCTCGTATGCCGTCTTCTGCTTG Without the wildcard, the removed sequence ranges from 37 to 94 bp. LNGU9.atropos.R1.fastq.txt LNGU9.atropos.R2.fastq.txt