tkonopka / GeneticThesaurus

Thesaurus for genetic variants
Apache License 2.0
9 stars 1 forks source link

Definition of the output BAF file columns #1

Closed qsonehara closed 2 years ago

qsonehara commented 2 years ago

Dear @tkonopka ,

I'm trying to apply GeneticThesaurus to our WGS data. Where could I look to know the definition of the output BAF files columns, especially sample.thesaurus.BAF, sample.naive.cov, and sample.thesaurus.cov? My understanding is that sample.thesaurus.BAF is a BAF estimate when the alternate sites are considered and that the frequency ranges from 0 to 1. However, it seems that some variants in my output BAF files have sample.thesaurus.BAF of >1, which I did not expect. Is my understanding incorrect?

tkonopka commented 2 years ago

Thanks for your interest in this software, @qsonehara. Yes, your impressions are correct. I'll try to write some details here in longform as a form of documentation.

Just to get on the same page: files with names ending with baf.tsv are tables where each row carries information about a mutation at a genomic locus/site. The sites are defined by coordinates in columns chr and position. But when mutations occur in genomic regions that are similar to other regions, the software creates links to other genomic sites (the links are enumerated in other files). For a given row in the table, some of the columns contain information from alignments at just the coordinates chr and position. Other columns contain data aggregated from all the linked sites.

When processing a single sample, the columns will be:

Allelic frequencies at a single site are calculated through simple ratios, so, as you noted, they take values in [0, 1]. The adjusted estimates, however, use information from multiple genomic sites. The formula is thesaurus_baf = (thesaurus_synonyms+1) thesaurus_alt_count / thesaurus_coverage, where thesaurus_synonyms and thesaurus_coverage are values from two columns in the table. The quantity thesaurus_alt_count is not in the table; it represents the total number of reads that carry the alternate allele across the linked sites.

This formula is a crude attempt to counteract dilution effects due to mapping ambiguities. The adjusted baf works like an allelic frequency in some simple scenarios, but can give values >1. Consider some scenarios with two genomic regions that are very similar on diploid chromosomes. Because of the sequence similarity, consider that a position X in one region can be confused with position Y in the other region.

In summary, the adjustment works as a canonical allelic frequency only in simple scenarios. When it works, it is arguably more informative than the naive estimate. In more complex scenarios, the interpretation is more challenging. Values of the adjusted baf >1 are one of the signals that there may be more going on than can be captured by the baf.tsv table, and by this software. Including values >1 in the output was a deliberate choice to convey this signal to downstream analysis.

Could the baf adjustment be improved? Yes. Tackling the second scenario above would require some assumptions about read-depth variability, but it is doable and would be useful. Tackling the issues with structural variants is more complicated and would require thought. You are welcome to work in this direction if it interests you!

What to do about baf estimates >1? This depends on your data and what you would like to achieve in downstream analysis. One approach could be to cap the estimates at unity (replace values >1 by 1). Or, you might want to analyze cases with adjusted baf >1 separately. Sorry I can't be more concrete here. This is largely an unexplored area still.

Hope this helps.

qsonehara commented 2 years ago

Thank you for your clear explanation! It is really helpful. I will utilize the thesaurus_baf values with >1 (especially very high ones) as quality control criteria for the thesaurus sites.