sequencing / NxTrim

Adapter trimming and virtual library creation for Illumina Nextera Mate Pair libraries.
BSD 2-Clause "Simplified" License
54 stars 13 forks source link

Make nxtrim resistant to junk polyG/C stretches from NextSeq device reads #42

Closed mmokrejs closed 7 years ago

mmokrejs commented 7 years ago

Hi Jared, while inspecting *.pe.fastq.gz from nxtrim -s .7 -w --separate --separate --preserve-mp --rf (v0.4.1-4965b00) I realized these contain suspiciously polyG stretches (unlike *.mp.fastq.gz files). I suspect they are made by read-joining code and I would nxtrim to avoid read joins on anything barely resembling polyG, polyC, polyN at least.

I fished out reads from *.pe.fastq.gz with grep GGGGGGGGGGGGGGG.

$  wc -l  HFYJ5AFXX.polyG.lst
272 HFYJ5AFXX.polyG.lst
$

I extracted the readnames and went back to original files from which I extracted both read mates. Note that very few original reads match the GGGGGGGGGGGGGGG query.

$ grep -c GGGGGGGGGGGGGGG HFYJ5AFXX.*.polyG.fastq 
HFYJ5AFXX.1_5kb_R1.polyG.fastq:2
HFYJ5AFXX.1_5kb_R2.polyG.fastq:3
HFYJ5AFXX.1_8kb_R1.polyG.fastq:1
HFYJ5AFXX.1_8kb_R2.polyG.fastq:4
$

HFYJ5AFXX.1_8kb_R2.polyG.fastq.txt HFYJ5AFXX.1_8kb_R1.polyG.fastq.txt HFYJ5AFXX.1_5kb_R2.polyG.fastq.txt HFYJ5AFXX.1_5kb_R1.polyG.fastq.txt

I attach the original reads, not those cleaned from Illumina sequencing adapters which were actually fed into nxtrim. However, I could attach them and also those from *.pe.fastq.gz files.

jaredo commented 7 years ago

Note that NxTrim does not join any reads by default, this is enabled by the --joinreads command. I have not found this to be helpful and do not use it, but your mileage may vary.

Most likely you see these in the pe library since that is the end of reads, which is typically of lower quality. It might be sensible to remove these before assembly, but in general I find assemblers are pretty darn robust to such issues.

In any case, this is out-of-scope for NxTrim and you could apply such clipping with general purpose read manipulation tools.

mmokrejs commented 7 years ago

Provided I did not include --joinreads on the commandline something else created them. As I said, you can hardly find GGGGGGGGGGGGGGG in the original reads but you do find that in all output from nxtrim.

jaredo commented 7 years ago

Yes, but you can find the reverse-complement of GGGGGGGGGGGGGGG

$  grep -c CCCCCCCCCCCCCCC HFYJ5AFXX.*.polyG.*
HFYJ5AFXX.1_5kb_R1.polyG.fastq.txt:144
HFYJ5AFXX.1_5kb_R2.polyG.fastq.txt:0
HFYJ5AFXX.1_8kb_R1.polyG.fastq.txt:128
HFYJ5AFXX.1_8kb_R2.polyG.fastq.txt:0