ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

Quality Control: Duplication bug, BAD MERGE #13

Open ndreey opened 2 months ago

ndreey commented 2 months ago

P12002 Duplicated Reads

Inspecting the FastQC and MultiQC reports.

Overrepresented and duplicated sequences

There are four samples (**_117, _145, _177, _187) that have high sequence duplication levels (60 % duplication levels). The remaining are around 19 %- 24 %.

The overrepresented sequence for _177 is: ATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATG which might have the possible source of TruSeq Adapter, Index 2 (97% over 36bp). Taking it through a blastn gives results showings 100% identity with a Fraxinus pennsylvanica (Ash) as more significant hits with corona viruses and phages (default values were used to blastn).

These are the overrepresented sequences

_117: ATCGGAAGAGCACACGTCTGAACTCCAGTCACGAATTCGTATCTCGTATG 
_145: ATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATG
_177: ATCGGAAGAGCACACGTCTGAACTCCAGTCACTCCGCGAAATCTCGTATG
_187: ATCGGAAGAGCACACGTCTGAACTCCAGTCACTCTCGCGCATCTCGTATG
rnd:  ATCGGAAGAGCACACGTCTGAACTCCAGTCACTAATGCGCATCTCGTATG
      ATCGGAAGAGCACACGTCTGAACTCCAGTCACTCTCGCGCATCTCGTATG

We expect to have the peak at ONE, indicating that each fragment was sequence once then for the plot to fall off. What we see in these four samples is that there is a peak at 2 and 4, the peak at 10 seems to fit with the rest of the data. Because it is metagenomics and the microbes can have varying abundances, it is not odd to see some sequences being more "duplicated" but not at these percentages.

image

Per Base Sequence content

The GC% and AT% is similar to the non-duplicated reads.

GC %

From the MultiQC we get a warning for P12002_109_R1 at Per Sequence GC Content. Overall, the range is between 35 % and 36 % for the rest.

The left graph is from _109 (FastQC) and right is from _108. We see that the distribution is similar with a small peak at low GC % and large peak around 36 %. The difference is the number of reads. Probably due to the adapters that have not been removed.

Conclusion

When comparing the FastQC results of OK samples (_101, _156) with the four duplicated samples it seems that the reason for the duplication is due to a technical aspect. I don't think it is PCR as then the peak would not be at x2/x4 but at higher levels like x10k. The most likely reason for the duplication levels for _117, _145, _177 and _187 is due to low DNA complexity compared to the others. All samples are sequenced with the same depth, if the DNA preparation for the high duplicated samples yielded low amount of DNA or low DNA complexity, then those sequences present will be over sequenced, leading to the odd duplication levels. When comparing with the other statistics (FastQC and MultiQC), they do not differ in Per Base Sequence Conent or Per Sequence Quality. As can bee seen in the MultiQC plot above, the green samples have some x2 duplication and x4 duplication. It is almost as if the red lines has been frame shifted, where x1 in green has become x2 in red.

However, i will need to double check my lane merging, perhaps they got written twice. When looking at the MultiQC from the 00-Reports folder these duplications are not found, all samples follow the same pattern. Hence, there is a good chance that the merger caused this (i hope it is, because biologically this duplication does not make sense).

In other words, if the merger was ok, then the troublesome samples will be de-duplicated or discarded.

ndreey commented 2 months ago

Merger was not ok

As i had tested my merge_samples_across_lanes.sh script on _145 (and evidently _117, _177 and _185, but not documented in #3) there existed merged files already for the samples. In my script i had used touch "R1_out" "R2_out" to overwrite existing samples. However, touch does not overwrite but re-creates the files including the information they contained, meaning it keeps the contents of existing files.

I have now updated my overwriting part of the code to truncate -s 0 "$R1_out" "$R2_out". truncate will generate the files and set the byte size to 0 (-s 0), hence if they exists they will become empty.

To make sure no other files were affected by this mishap i will re-run everything.

ndreey commented 2 months ago

Issue resolved

It was indeed the touch command that caused the problem. The MultiQC plot now shows all reads following the sample duplication distributions.

Jobstats P12002 QC_RAW

Jobstats P14052 QC_RAW