nf-core / eager

A fully reproducible and state-of-the-art ancient DNA analysis pipeline
https://nf-co.re/eager
MIT License
148 stars 82 forks source link

Add complexity filter? #13

Closed jfy133 closed 6 years ago

jfy133 commented 6 years ago

In our group we've noticed that we regularly get lots of poly G reads from NextSeq data which don't get discarded by the sequencer or demultiplexer. This can mess up some downstream statistics if not thrown out.

Maybe we could consider having as a module some form of complexity filter to remove low complexity reads?

apeltzer commented 6 years ago

Good idea - do you know any tool to be able to do/achieve that?

jfy133 commented 6 years ago

PRINSEQ I believe

http://prinseq.sourceforge.net/manual.html#QCDUPLICATION

jfy133 commented 6 years ago

Possibly a better (more recent) tool designed specifically for the case described above: https://github.com/OpenGene/fastp#polyg-tail-trimming

apeltzer commented 6 years ago

Fastp is really a nice tool.

I'll add this before AdapterRemoval, keeping adapters and qualities untouched and only performing the poly_g_trimming on demand (default off, but people can turn it on if they want to!)

apeltzer commented 6 years ago

Some notes for myself:

SE: 

fastp -in1 read1 -out1 "${read.baseName}.pG.fq.gz" -A -g --poly_g_min_lin 10 -Q -L 
-w ${task.cpus} -json "${read.baseName}"_fastp.json 

PE:
fastp -in1 read1 -in2  -out1 "${read.baseName}.pG.fq.gz" -out2 "${read.baseName}.pG.fq.gz" -A -g --poly_g_min_lin 10 -Q -L 
-w ${task.cpus} -json "${read.baseName}"_fastp.json 

parameters to add:

params.complexity_filter = false
params.complexity_filter_poly_g_min = 10
apeltzer commented 6 years ago

As of commit 24c33290d9a1fe50ab45e71777dd3d7e6f512b24 , this is implemented and also covered by test cases for both single end and paired end data.