jts / ncov-tools

Small collection of tools for performing quality control on coronavirus sequencing data and genomes
MIT License
47 stars 16 forks source link

Question about ambiguity threshold #97

Closed ChadFibke closed 2 years ago

ChadFibke commented 2 years ago

Hi @jts,

I hope all is going well. I'm currently reading through ncov-tools and had a question about the qc ambiguity threshold in: make_sample_qc_summary

Based on the information in get_qc.py, it appears you count the number of ambiguous IUPAC characters in the consensus sequence genome using count_iupac_in_fasta, and if there are more then 5 ambiguous codes found across the genome you assign EXCESS_AMBIGUITY to the sample. Does this mean a sample has EXCESS AMBIGUITY if say 99% of the genome is covered, but 5 positions are ambiguous? If so, wouldn't that be too conservative? Are there empirical data to support this cut off to suggest these samples are contaminated, or am I off on the interpretation?

Best, Chad

rdeborja commented 2 years ago

Hi Chad,

The flags in the qc_pass column of the qc_reports/_summary_qc.tsv file were intended as an indicator for further review and not as a hard cut-off for a sample to fail. We use it in conjunction with the qc_reports/*_ambiguous_position_report.tsv which provides details on the position, the total number of samples with the ambiguous base and the IUPAC code.

Cheers, Richard

ChadFibke commented 2 years ago

Hi @rdeborja,

Thanks for your response. Is there any information you use to carry-forward/discount a sample once flagged with excess ambiguity (I would imagine any position overlapping positions-of-interest)?

Best, Chad

rdeborja commented 2 years ago

Yes to the overlapping position of interest. The number of samples with the same ambiguity code at the same position can be found in the qc_reports/*ambiguous_position_report.tsv. I'll follow up by looking into the qc_reports/*mixture_report.tsv file which provides allele counts and proportions. Never hurts to look at the BAMs in IGV too. Note that the mixture and ambiguous base reports apply to the Illumina platform only at this time. We're still working with the error rates in Oxford Nanopore data to get the same type of reports.

jts commented 2 years ago

Just to chime in with a few thoughts. Ambiguous positions can happen for quite a few reasons:

-RT/amplification artifacts for low quality/quantity samples. These should be relatively sporadic so not consistent across samples -contamination. Depending on how bad the contamination is this can lead to a few samples with the same artifact (or many if the contamination is particularly bad) -incorrect primer trimming -alignment artifacts -true intra-host variation/co-infections (rare but not unheard of)

So as @rdeborja said the interpretation is situational so this is a flag for followup/inspection. In general though if the sample has high Ct and/or low completeness the ambiguous bases are probably caused by RT/amplification issues.

ChadFibke commented 2 years ago

Thanks for all the input, I appreciate it!