Closed jfy133 closed 6 years ago
Good idea - do you know any tool to be able to do/achieve that?
PRINSEQ I believe
Possibly a better (more recent) tool designed specifically for the case described above: https://github.com/OpenGene/fastp#polyg-tail-trimming
Fastp is really a nice tool.
I'll add this before AdapterRemoval, keeping adapters and qualities untouched and only performing the poly_g_trimming on demand (default off, but people can turn it on if they want to!)
Some notes for myself:
SE:
fastp -in1 read1 -out1 "${read.baseName}.pG.fq.gz" -A -g --poly_g_min_lin 10 -Q -L
-w ${task.cpus} -json "${read.baseName}"_fastp.json
PE:
fastp -in1 read1 -in2 -out1 "${read.baseName}.pG.fq.gz" -out2 "${read.baseName}.pG.fq.gz" -A -g --poly_g_min_lin 10 -Q -L
-w ${task.cpus} -json "${read.baseName}"_fastp.json
parameters to add:
params.complexity_filter = false
params.complexity_filter_poly_g_min = 10
As of commit 24c33290d9a1fe50ab45e71777dd3d7e6f512b24 , this is implemented and also covered by test cases for both single end and paired end data.
In our group we've noticed that we regularly get lots of poly G reads from NextSeq data which don't get discarded by the sequencer or demultiplexer. This can mess up some downstream statistics if not thrown out.
Maybe we could consider having as a module some form of complexity filter to remove low complexity reads?