SNP frequencies' results interpretation

s-andrews / BamQC

Mapped QC analysis program

GNU General Public License v3.0

42 stars 8 forks source link

SNP frequencies' results interpretation #23

Closed mariels closed 7 years ago

mariels commented 8 years ago

Hello,

I am working with whole genome re-sequencing Illumina X-ten PE data. I have run bamQC on my filtered bam files, I got warning messages on the SNP frequencies and SNP frequencies per type.

For the SNP frequency along the read, how is the total SNPs frequency calculated?

I have attached the results I got but I am not sure how to interpret the plots. Could you please provide advices? I am wondering if there is anything that could be wrong with the data and if further data filtering is required.

Thank you, Best regards,

Marie

BamQC_SNFfreq.pdf

pdp10 commented 7 years ago

Dear Marie,

Thanks for your message. BamQC has not yet been officially released and one aspect which has not yet well tested is the threshold level for warning and error messages ( see Issue #5 ).

It is worth noting that warning / error levels in BamQC should be interpreted as a suggestion for general cases, and are not meant to replace a careful examination of the analysis results.

Kind regards, Piero

mariels commented 7 years ago

Dear Piero,

Thanks for your answer. I have another question. After indel realignment I got a peak of deletions frequencies at the end of the read (at 151 bp but the read should be 150bp - please see attached pdf). Should I interpret this as the last base being considered as a deletion or the last base being cut or that the reads have been realigned and the indel is situated after the read? I would be interested in knowing how it is calculated to be able to understand the results.

Best regards,

Marie indel_realignment.pdf

pdp10 commented 7 years ago

Dear Marie,

Just to make sure BamQC does not do any read realignment. For each read position, the number of indels and the total number of reads (these might have different lengths) are counted. Therefore the plots show the percent of indels for each position. If first and second reads are present, these are calculated and shown separately. For instance at position 70bp, you have about 0.12% of deletions for the first read. This means that 0.12% of the reads contain deletions at that position. The last base shown in the plot is the position of the longest read. If some reads have deletions at that position they are shown as described above.

Hope this helps, Kind regards, Piero