ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

Quality Control: Trimming with fastp v0.23.4 #19

Open ndreey opened 2 months ago

ndreey commented 2 months ago

With the duplication resolved #13 we can now move forward to trimming the raw data accordingly.

fastp

Fastp is readily updated and has good documentation and not to mention the most efficient quality processor of FASTQ files. Fastp can automatically recognize adapters, trim by average quality or sliding window, and de-duplicate to mention a few of its features. Besides the trimmed reads, fastp outputs reports of the quality before and after in .JSON or .HTML. In the paper they report that fastp 0.23.2 could process 9.4GB files in 00:07:37 whereas Trimmomatic required over an hour.

Trimming strategy

The overall quality is great for both sequence projects in either R1 and R2, what is needed is:

Test data

To get an idea i have 10 P12002 and 10 P14052 forward and reverse .fastq files that i will use to test trimming parameters.

_test_trimsamples.txt

P12002_101      P12002_101_R1.fastq.gz  P12002_101_R2.fastq.gz
P12002_102      P12002_102_R1.fastq.gz  P12002_102_R2.fastq.gz
P12002_103      P12002_103_R1.fastq.gz  P12002_103_R2.fastq.gz
P12002_104      P12002_104_R1.fastq.gz  P12002_104_R2.fastq.gz
P12002_105      P12002_105_R1.fastq.gz  P12002_105_R2.fastq.gz
P12002_106      P12002_106_R1.fastq.gz  P12002_106_R2.fastq.gz
P12002_107      P12002_107_R1.fastq.gz  P12002_107_R2.fastq.gz
P12002_108      P12002_108_R1.fastq.gz  P12002_108_R2.fastq.gz
P12002_109      P12002_109_R1.fastq.gz  P12002_109_R2.fastq.gz
P12002_110      P12002_110_R1.fastq.gz  P12002_110_R2.fastq.gz
P14052_118      P14052_118_R1.fastq.gz  P14052_118_R2.fastq.gz
P14052_119      P14052_119_R1.fastq.gz  P14052_119_R2.fastq.gz
P14052_120      P14052_120_R1.fastq.gz  P14052_120_R2.fastq.gz
P14052_121      P14052_121_R1.fastq.gz  P14052_121_R2.fastq.gz
P14052_122      P14052_122_R1.fastq.gz  P14052_122_R2.fastq.gz
P14052_123      P14052_123_R1.fastq.gz  P14052_123_R2.fastq.gz
P14052_124      P14052_124_R1.fastq.gz  P14052_124_R2.fastq.gz
P14052_125      P14052_125_R1.fastq.gz  P14052_125_R2.fastq.gz
P14052_126      P14052_126_R1.fastq.gz  P14052_126_R2.fastq.gz
P14052_127      P14052_127_R1.fastq.gz  P14052_127_R2.fastq.gz
ndreey commented 2 months ago

Initial test

I will run these parameters for the first test.

I want to keep the reads as long as possible, hence why sliding window is not implemented in this run.

fastp parameters

    fastp \
        --in1 $R1_in --in2 $R2_in --out1 $R1_out --out2 $R2_out \
        --html "$sample-fastp.html" --json "$sample-fastp.json" \
        --report_title "$sample fastp report" --thread 8 \
        --average_qual 22 \
        --length_required 50 \
        --detect_adapter_for_pe --verbose \
        --trim_poly_g \
        --overrepresentation_analysis

Results

Runtime

Running with 8 cores and for one hour, the fastp managed to process 17 of the 20 files.

General statistics

Duplication % now ranges between 0.4-5.3% instead of ~20-30%. The number of reads that passed the thresholds were between 88.6% to 98.5%.

Sequence quality

The orange lines are P14052 and blue are P12002. We see that P14052 definitely have a more stable quality while P12002 varies more. Overall, the trim seems almost unnecessary in these plots, but we see more improvements in the other statistics.

Left is before, right is post trim

GC content

We see that the trim smooths out the GC spikes.

N content

We notice that the N content is not dealt with, it is however below 1%. It feels unessesary to do a global trim of the first nucleotide in all reads.

Test 2

Same parameters but changed these parameters:

The results were bit better. It is clear that P12002 might need more stringent trimming compared to P14052.

ndreey commented 2 months ago

Test 3

fastp \
        --in1 $R1_in --in2 $R2_in --out1 $R1_out --out2 $R2_out \
        --html "${fastp_dir}/$sample-fastp.html" \
        --json "${fastp_dir}/$sample-fastp.json" \
        --report_title "$sample fastp report" \
        --average_qual 20 \
        --length_required 100 \
        --cut_right \
        --cut_right_window_size 4 --cut_right_mean_quality 15 \
        --detect_adapter_for_pe \
        --trim_poly_g \
        --dedup \
        --thread 8 \
        --verbose 

P12002

Here we see a jump in the number of Q20 and Q30 bases. Prior trim could have 92% Q20 and 86% Q30 bases and post trim it jumped up to 97% Q20 and 94% Q30 bases. The consequence is that we have sequences ranging from 100-151 bp. Although, the peak (30 million reads) is still at 150, the other lengths (100-148) don't go above 200k reads at each length. Another issue is that we loose 16% of the data.

image

image

P14052

The reads are already so good that this does not affect P14052 samples.

I will use stricter trimming for P12002 samples and focus on removing adapters, polyG's and deduplicate P14052.

ndreey commented 2 months ago

Conclusion

Numerous papers use different filtering. Some use sliding windows, some dont trim at all, some are strict with only keeping reads of length 100, others as low as 36. After information overload and multiple tests, i have decided to go with the more relaxed:

P14052's raw data has great quality with a few whiskers dipping below Q28. We have however adapters and duplication levels that need to be handled. image image

P12002's raw data is a little more varied. The boxes towards the end dip below Q30 and we have whiskers crossing Q20. But the overall quality is still good as can be seen from the MultiQC plot. image image

As P12002 had more varied data, various parameter combinations were tested. As can be seen below, yes the number of Q30 bases rises from ~89% to ~94% but we loose ~17% of the data compared to the relaxed where we only loose ~7%. image

The parameters set

    # Store reads where either R1 or R2 does not meet thresholds
    # Discard reads below average quality of Q20
    # Discard reads that does not meet minimum length 100
    # Remove duplicates, adapters, and polyG's
    fastp \
        --in1 $R1_in --in2 $R2_in --out1 $R1_out --out2 $R2_out \
        --html "${fastp_dir}/$sample-fastp.html" \
        --json "${fastp_dir}/$sample-fastp.json" \
        --report_title "$sample fastp report" \
        --unpaired1 ${fastp_dir}/00.unpaired.fasta.gz \
        --unpaired2 ${fastp_dir}/00.unpaired.fasta.gz \
        --average_qual 20 \
        --length_required 36 \
        --detect_adapter_for_pe \
        --trim_poly_g \
        --dedup \
        --thread 2 \
        --verbose 

To give an idea, we see how these parameters affect P12002_101. image image

The data does not improve that much, yet again, the data is still already really good. Important is that adapter and deduplication is handled. image