sequencing / NxTrim

Adapter trimming and virtual library creation for Illumina Nextera Mate Pair libraries.
BSD 2-Clause "Simplified" License
54 stars 13 forks source link

nxtrim does not remove Illumina adapters when --rf is used #32

Closed mmokrejs closed 7 years ago

mmokrejs commented 7 years ago

Hi, the runtime help is a bit unclear what the --rf option does. However, it appears nxtrim v0.4.1-7db257e changes RF reads into FR by default. I found that it does not remove Illumina adapters if I used --rf to override the default behavior. Probably because it searches for the adapters in "forward" representation and does not try the reverse-complementary form as well. Hence, it does not work if --rf was used. Am I guessing correctly?

Further, I propose changing --rf to --disable-RF-to-FR-conversion.

Finally, would you mind documenting in the README file which assemblers/mappers cannot work with the RF reads and require FR as input?

jaredo commented 7 years ago

Hello!

The --rf tag should return reads in reverse-forward orientation without any other difference. If adapter removal is being affected by --rf then it is a bug. If you can send me an example read-pair with the problem that would be much appreciated.

I'm not keen to change the name of the argument since people might already be relying on the current name.

I have documented how to use SPAdes/Velvet with the nxtrim, but I don't have experience with other assemblers. Perhaps a wiki entry with successful assemblies by users could be added. Similar to this: https://github.com/sequencing/NxTrim/wiki/Bacterial-assemblies-using-Nextera-Mate-pairs

mmokrejs commented 7 years ago

The --rf tag should return reads in reverse-forward orientation without any other difference.

I had the impression they are "naturally" in RF orientation. Therefore, I thought this option is to flip them so that they appear as ordinary FR reads. Would you mind clarifying the description of the option? Currently it states:

--rf leave reads in RF orientation (or use this if your reads are already in FR orientation)

I have documented how to use SPAdes/Velvet with the nxtrim, but I don't have experience with other assemblers.

Well it seemed you know it is safer to flip the RF reads into FR reads in general. also the Illumina docs on Data Processing of Nextera Mate Pair Reads ... seemed to be on the same wave.

I will prepare some testcases.

jaredo commented 7 years ago

I had the impression they are "naturally" in RF orientation.

This is correct.

Therefore, I thought this option is to flip them so that they appear as ordinary FR reads.

It is the opposite. Basically, if you want FR reads do not use --rf. If you want RF reads, use --rf.

The default behaviour of nxtrim (without --rf) takes the "natural" RF reads and reverse complements them when appropriate such that the output is FR (the desired orientation for velvet/spades). Note there is a complication here in that the virtual PE reads are already FR so they do not need to be reverse-complemented.

The --rf flag does the opposite of this default behaviour, returning reads as RF.

mmokrejs commented 7 years ago

--rf leave reads in RF orientation (or use this if your reads are already in FR orientation)

Still, the documentation string is confusing. Does the "(or use this if your reads are already in FR orientation)" really mean it will flip input FR into into desired RF orientation. I don't believe. ;-)

At least I would propose:

--rf leave mate pair reads in RF orientation as they are [by default are flipped into FR]

jaredo commented 7 years ago

Agreed. Changed the help correspondingly.

Please note the README describes the nxtrim process in detail:

The trimmer will reverse-complement the reads such that the resulting libraries will be in Forward-Reverse (FR) orientation, if you wish to keep your reads as Reverse-Forward then use --rf flag.

mmokrejs commented 7 years ago

Here are reads containing TruSeq_Adapter_Index_6 with GCCAAT barcode as they were output from nxtrim (after RF to FR conversion). Please note that these reads were from Illumina NextSeq device, so the nxtrim changed trailing polyG into leading polyC. IMHO this will cause issues with downstream trimming of reads. At least it should be emphasized in the README file:

> NB501598:62:HFYJ5AFXX:1:11305:22286:4051 1:N:0:GCCAAT
Length=155

 Score =   127 bits (64),  Expect = 2e-28
 Identities = 64/64 (100%), Gaps = 0/64 (0%)
 Strand=Plus/Minus

Query  1   AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTG  60
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  79  AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTG  20

Query  61  CTTG  64
           ||||
Sbjct  19  CTTG  16

> NB501598:62:HFYJ5AFXX:1:11201:5173:2350 1:N:0:GCCAAT
Length=155

 Score =   127 bits (64),  Expect = 2e-28
 Identities = 64/64 (100%), Gaps = 0/64 (0%)
 Strand=Plus/Minus

Query  1   AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTG  60
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  68  AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTG  9

Query  61  CTTG  64
           ||||
Sbjct  8   CTTG  5

> NB501598:62:HFYJ5AFXX:1:21203:4547:6784 1:N:0:GCCAAT
Length=155

 Score =   125 bits (63),  Expect = 7e-28
 Identities = 63/63 (100%), Gaps = 0/63 (0%)
 Strand=Plus/Minus

Query  2    GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGC  61
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  155  GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGC  96

Query  62   TTG  64
            |||
Sbjct  95   TTG  93

> NB501598:62:HFYJ5AFXX:1:11108:9901:15649 1:N:0:GCCAAT
Length=155

 Score =   119 bits (60),  Expect = 4e-26
 Identities = 63/64 (98%), Gaps = 0/64 (0%)
 Strand=Plus/Minus

Query  1    AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTG  60
            |||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||
Sbjct  155  AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTGCTG  96

Query  61   CTTG  64
            ||||
Sbjct  95   CTTG  92

> NB501598:62:HFYJ5AFXX:1:11103:16230:15181 1:N:0:GCCAAT
Length=155

 Score =   119 bits (60),  Expect = 4e-26
 Identities = 63/64 (98%), Gaps = 0/64 (0%)
 Strand=Plus/Minus

Query  1    AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTG  60
            |||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||
Sbjct  155  AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATATCGTATGCCGTCTTCTG  96

Query  61   CTTG  64
            ||||
Sbjct  95   CTTG  92
@NB501598:62:HFYJ5AFXX:1:11103:16230:15181 1:N:0:GCCAAT
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCTATTTTTTTTTCAAGCAGAAGACGGCATACGATATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
+
A/AEEEA<A<<E</EE/A/<6AAA/A//<E///A/E/EEE/EE/E////6/</<///<<//A/<///E</////////////AEE6EEEEE/EAEE/EAEEEAEE6/EE//EEEEEEE//</EAE/EEEEEEEEEAE//E/EEEE///EEAAA/A
@NB501598:62:HFYJ5AFXX:1:11108:9901:15649 1:N:0:GCCAAT
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCATTTTTTTTTCAAGCAGCAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
+
/AEAEAEEEEEE<<EE<<AE<<<AEEEEEEEEEEAEEE</AE<A<EAAAEAEEEE/EEEEEEE/<//EAE/EAEEE/A/////EEE66E//A/////E/EEEE//EEEAA///EAE//E/E///EEEEEEE//EAEEAEEA6EE6E6/EEAAAA/
@NB501598:62:HFYJ5AFXX:1:11201:5173:2350 1:N:0:GCCAAT
TTTTCAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTATAGAAGTCCAAAAAGCTTGGCCATCACGTGATTCATGGAGACTTCAGTTAGCTCCTGGAAGCTCATAGTGAGCCATTGAAATACAT
+
EAAAEEA<<AEA<EEEEEE/EEEEAE</EEEEEEE/EEEAEEEE//EEA<EEEAEEEEEAEEAEEEEEE/EEEEEAEEEEEEAEEEE<EEEE6EEEEEEEE/EEEEAEEEAEEEEEEEEEE6EEEAEEEAEEEEEEE6EEEEEEEEEEEEAAAAA
@NB501598:62:HFYJ5AFXX:1:11305:22286:4051 1:N:0:GCCAAT
CCCGCTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTGGTCCATCCTTCTAACGTTTATCTGTTTATATTATTCGAGGGTTTATGTGGGTTGGTTTATTTGTTTATGATTAA
+
<///////EEEAEAEA//<</6E/A/6/EAEE/A////A/EE/E<</EA/<6//<AAAA///AEAEA/E/EA//6/E/EE6<E//AE//EEAA/////E////E///////6E/AE//////EE//EEEAEEAE//6//EA//A/EEE6EAA/AA
@NB501598:62:HFYJ5AFXX:1:21203:4547:6784 1:N:0:GCCAAT
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCATTTTTTTTTTCAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC
+
/<A<<A/AA/<A<//<EAA<</A6A/EA/EAEAA<EAEEEEEEEEAEAA</EAEEEEEEEEEEEAEEEEEEEEAEEA</E////EEEEEE/E/EEEEEEEEEEEEAEEEEE<E/EEEAEEEEEEEEEE<AEEEAEEE/EEEEEEEEEEEEAAA/A

Provided you Closed this bug already lets move the issue with unremoved Illumina adapters under a new issue. The above I included to emphasize what the blind RF to FR conversion causes.