Open ndreey opened 7 months ago
I will run these parameters for the first test.
I want to keep the reads as long as possible, hence why sliding window is not implemented in this run.
fastp parameters
fastp \
--in1 $R1_in --in2 $R2_in --out1 $R1_out --out2 $R2_out \
--html "$sample-fastp.html" --json "$sample-fastp.json" \
--report_title "$sample fastp report" --thread 8 \
--average_qual 22 \
--length_required 50 \
--detect_adapter_for_pe --verbose \
--trim_poly_g \
--overrepresentation_analysis
Running with 8 cores and for one hour, the fastp managed to process 17 of the 20 files.
Duplication % now ranges between 0.4-5.3% instead of ~20-30%. The number of reads that passed the thresholds were between 88.6% to 98.5%.
The orange lines are P14052 and blue are P12002. We see that P14052 definitely have a more stable quality while P12002 varies more. Overall, the trim seems almost unnecessary in these plots, but we see more improvements in the other statistics.
We see that the trim smooths out the GC spikes.
We notice that the N content is not dealt with, it is however below 1%. It feels unessesary to do a global trim of the first nucleotide in all reads.
Same parameters but changed these parameters:
--average_qual 20
--length_required 100
The results were bit better. It is clear that P12002 might need more stringent trimming compared to P14052.
fastp \
--in1 $R1_in --in2 $R2_in --out1 $R1_out --out2 $R2_out \
--html "${fastp_dir}/$sample-fastp.html" \
--json "${fastp_dir}/$sample-fastp.json" \
--report_title "$sample fastp report" \
--average_qual 20 \
--length_required 100 \
--cut_right \
--cut_right_window_size 4 --cut_right_mean_quality 15 \
--detect_adapter_for_pe \
--trim_poly_g \
--dedup \
--thread 8 \
--verbose
Here we see a jump in the number of Q20 and Q30 bases. Prior trim could have 92% Q20 and 86% Q30 bases and post trim it jumped up to 97% Q20 and 94% Q30 bases. The consequence is that we have sequences ranging from 100-151 bp. Although, the peak (30 million reads) is still at 150, the other lengths (100-148) don't go above 200k reads at each length. Another issue is that we loose 16% of the data.
The reads are already so good that this does not affect P14052 samples.
I will use stricter trimming for P12002 samples and focus on removing adapters, polyG's and deduplicate P14052.
Numerous papers use different filtering. Some use sliding windows, some dont trim at all, some are strict with only keeping reads of length 100, others as low as 36. After information overload and multiple tests, i have decided to go with the more relaxed:
P14052's raw data has great quality with a few whiskers dipping below Q28. We have however adapters and duplication levels that need to be handled.
P12002's raw data is a little more varied. The boxes towards the end dip below Q30 and we have whiskers crossing Q20. But the overall quality is still good as can be seen from the MultiQC plot.
As P12002 had more varied data, various parameter combinations were tested. As can be seen below, yes the number of Q30 bases rises from ~89% to ~94% but we loose ~17% of the data compared to the relaxed where we only loose ~7%.
The parameters set
# Store reads where either R1 or R2 does not meet thresholds
# Discard reads below average quality of Q20
# Discard reads that does not meet minimum length 100
# Remove duplicates, adapters, and polyG's
fastp \
--in1 $R1_in --in2 $R2_in --out1 $R1_out --out2 $R2_out \
--html "${fastp_dir}/$sample-fastp.html" \
--json "${fastp_dir}/$sample-fastp.json" \
--report_title "$sample fastp report" \
--unpaired1 ${fastp_dir}/00.unpaired.fasta.gz \
--unpaired2 ${fastp_dir}/00.unpaired.fasta.gz \
--average_qual 20 \
--length_required 36 \
--detect_adapter_for_pe \
--trim_poly_g \
--dedup \
--thread 2 \
--verbose
To give an idea, we see how these parameters affect P12002_101.
The data does not improve that much, yet again, the data is still already really good. Important is that adapter and deduplication is handled.
With the duplication resolved #13 we can now move forward to trimming the raw data accordingly.
fastp
Fastp is readily updated and has good documentation and not to mention the most efficient quality processor of FASTQ files. Fastp can automatically recognize adapters, trim by average quality or sliding window, and de-duplicate to mention a few of its features. Besides the trimmed reads, fastp outputs reports of the quality before and after in .JSON or .HTML. In the paper they report that fastp 0.23.2 could process 9.4GB files in 00:07:37 whereas Trimmomatic required over an hour.
Trimming strategy
The overall quality is great for both sequence projects in either R1 and R2, what is needed is:
Test data
To get an idea i have 10 P12002 and 10 P14052 forward and reverse .fastq files that i will use to test trimming parameters.
_test_trimsamples.txt