relipmoc / skewer

MIT License
95 stars 17 forks source link

N's don't get removed, trimming read quality #19

Open JustGitting opened 9 years ago

JustGitting commented 9 years ago

Hi,

I'm just trying out skewer on some 2 x 150bp Nextera Mate pair reads (sequenced on a NextSeq) from a endophyte (fungus), so forgive me if I've setting something wrong.

In one of the raw read files, I found a large number of N sequences.

$ zgrep NNNNNNNNNNNNNNNNNNNNNNN  S1_L001_R1_001.fastq.gz | wc -l
332915

So using mostly default settings, I ran skewer as follows:

ADAPTER_3PRIM="GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
ADAPTER_5PRIM="GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
ADAPTER_JUNCTION="CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG"
OPTIONS="-m mp -k 9 -f sanger -Q 20 -n -t 4 -z"

skewer-0.1.123-linux-x86_6 -x ${ADAPTER_3PRIM} -y ${ADAPTER_5PRIM} -j ${ADAPTER_JUNCTION} ${OPTIONS} S1_L001_R1_001.fastq.gz S1_L001_R2_001.fastq.gz -o S1_L001_001_skewer

Checking the log we have:

COMMAND LINE:   skewer-0.1.123-linux-x86_64 -x GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -y GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -j CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -m mp -k 9 -f sanger -Q 20 -n -t 4 -z S1_L001_R1_001.fastq.gz S1_L001_R2_001.fastq.gz -o S1_L001_001_skewer
Input file:     S1_L001_R1_001.fastq.gz
Paired file:    S1_L001_R2_001.fastq.gz
trimmed:        S1_L001_001_skewer-trimmed-pair1.fastq.gz, S1_L001_001_skewer-trimmed-pair2.fastq.gz

Parameters used:
-- 3' end adapter sequence (-x):        GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-- paired 3' end adapter sequence (-y): GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
-- junction adapter sequence (-j):      CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
-- maximum error ratio allowed (-r):    0.100
-- maximum indel error ratio allowed (-d):      0.030
-- mean quality threshold (-Q):         20
-- minimum read length allowed after trimming (-l):     18
-- file format (-f):            Sanger/Illumina 1.8+ FASTQ
-- minimum overlap length for junction adapter detection (-k):  9
-- number of concurrent threads (-t):   4
Thu May  7 09:23:41 2015 >> started

Thu May  7 09:26:44 2015 >> done (183.298s)
5780103 read pairs processed; of these:
  16577 ( 0.29%) degenerative read pairs filtered out
  13690 ( 0.24%) read pairs filtered out by quality control
 106181 ( 1.84%) non-junction read pairs filtered out by contaminant control
  14878 ( 0.26%) short read pairs filtered out after trimming by size control
 221913 ( 3.84%) empty read pairs filtered out after trimming by size control
5406864 (93.54%) read pairs available; of these:
  86780 ( 1.60%) trimmed read pairs available after processing
5320084 (98.40%) untrimmed read pairs available after processing

However, the resulting trimmed file still contains large numbers of NNNN's.

$ zgrep NNNNNNNNNNNNNNNNNNNNNNNNN S1_L001_001_skewer-trimmed-pair1.fastq.gz | wc -l
218901

I've check the documentation and it's no clear as to why there are large runs of N's.

Further, running Fastqc (v 0.11.3) shows there is virtually no difference between the "per base sequence quality" of the raw reads and the trimmed reads (-Q 20), as seen in the raw graph:

per_base_quality

Skewer per base quality graph:

per_base_quality

Any advice much appreciated.

relipmoc commented 9 years ago

It seems that the "-Q 20" did not work as expected. Could you run it again without the parameter "-f sanger"?

JustGitting commented 9 years ago

Hi Relipmoc,

So I re-run skewer without the "-f sanger" option and there was a major improvement.

$ zgrep NNNNNNNNNNNNNNNNNNNNNNNNN S1_L001_001_skewer-trimmed-pair1.fastq.gz | wc -l
26

The fastqc per-base quality looks better too.

per_base_quality

What's the next step?