Definition of the output BAF file columns

qsonehara commented 2 years ago

Dear @tkonopka ,

I'm trying to apply GeneticThesaurus to our WGS data. Where could I look to know the definition of the output BAF files columns, especially sample.thesaurus.BAF, sample.naive.cov, and sample.thesaurus.cov? My understanding is that sample.thesaurus.BAF is a BAF estimate when the alternate sites are considered and that the frequency ranges from 0 to 1. However, it seems that some variants in my output BAF files have sample.thesaurus.BAF of >1, which I did not expect. Is my understanding incorrect?

tkonopka commented 2 years ago

Thanks for your interest in this software, @qsonehara. Yes, your impressions are correct. I'll try to write some details here in longform as a form of documentation.

Just to get on the same page: files with names ending with baf.tsv are tables where each row carries information about a mutation at a genomic locus/site. The sites are defined by coordinates in columns chr and position. But when mutations occur in genomic regions that are similar to other regions, the software creates links to other genomic sites (the links are enumerated in other files). For a given row in the table, some of the columns contain information from alignments at just the coordinates chr and position. Other columns contain data aggregated from all the linked sites.

When processing a single sample, the columns will be:

chr - string, chromosome
position - integer, placement along the chromosome
ref - character, reference base at the site
alt - character, mutation
thesaurus.synonyms - integer, number of links to alternative genomic sites
sample.naive.BAF - float, estimate of the allelic frequency of the mutation in the sample. This is computed from reads aligned only to the given chromosome and position. It should correspond to the number of reads carrying the mutation divided by the number of reads aligned at the site
sample.thesaurus.BAF - float, adjusted estimate of the allelic frequency of the mutation. This estimate is based on data from the given chromosome and position plus data from the alternative genomic sites.
sample.naive.cov - integer, number of reads aligned to the site
sample.thesaurus.cov - integer, combined number of reads covering the site and the alternative sites

Allelic frequencies at a single site are calculated through simple ratios, so, as you noted, they take values in [0, 1]. The adjusted estimates, however, use information from multiple genomic sites. The formula is thesaurus_baf = (thesaurus_synonyms+1) thesaurus_alt_count / thesaurus_coverage, where thesaurus_synonyms and thesaurus_coverage are values from two columns in the table. The quantity thesaurus_alt_count is not in the table; it represents the total number of reads that carry the alternate allele across the linked sites.

This formula is a crude attempt to counteract dilution effects due to mapping ambiguities. The adjusted baf works like an allelic frequency in some simple scenarios, but can give values >1. Consider some scenarios with two genomic regions that are very similar on diploid chromosomes. Because of the sequence similarity, consider that a position X in one region can be confused with position Y in the other region.

Suppose the ground truth is that there is a heterozygous substitution at X (expected baf 0.5), no mutations at Y, and sequencing depth is around 40. Keeping numbers round for the example, or thinking in terms of expected values, a dataset might then have 20 reads that carry the mutation and 60 reads that do not. Upon alignment, the reads would be split onto the two genomic regions. Position X would receive half of the mutated reads and half of the non-mutated reads, so the naive baf would be ~10/40=0.25. The adjustment procedure would instead link X to one additional site (1 synonym), and use the total mutation count (20) and total coverage (80). The adjusted baf would be ~(1+1)*20/80=0.5. The adjustment thus brings the estimate closer to the ground truth.
Suppose the ground truth is that a mutation is homozygous at X (expected baf 1.0), there is no mutation at Y, and sequencing depth is 40. Suppose a dataset has 40 reads with mutations and 40 reads without, aligned and spread over the two sites. The naive baf at each site would be ~20/40=0.5, and the adjusted baf would be ~(1+1)*40/80=1.0. The adjusted baf again works well. But if we imagine that a dataset has 42 reads with the mutation and 38 without, just by chance, then the adjusted baf can go slightly above 1.

Suppose that there is a homozygous substitution at X, there is no mutation at Y, and sequencing depth is 40. Suppose that the region holding Y is missing or has very low coverage, and the aligner places all reads correctly at position X and none at Y. This can happen for several reasons, e.g. when there are other substitutions or structural variants in the neighborhood, when there are biases in library capture due to the wider neighborhoods around X and Y, if the alignment pipeline used masking, or when the link-making algorithm produced too many links. Then the adjusted baf may actually become as high as 2, 3, even higher, and it becomes hard to interpret it as a canonical allelic frequency.
Suppose that there is a homozygous substitutions at both X and Y, and sequencing depth is 40. (This can happen in a highly mutated sample, or it can be indicative of issues with the reference sequence). A dataset would then have 80 reads with the mutation, 40 at each X and Y. The naive baf would be ~40/40=1. The adjusted baf at X would be ~(1+1)*80/80=2. The adjusted baf conveys that there are two mutations in the X-Y cluster. So while the value is unconventional as an allelic frequency, it is interpretable.

In summary, the adjustment works as a canonical allelic frequency only in simple scenarios. When it works, it is arguably more informative than the naive estimate. In more complex scenarios, the interpretation is more challenging. Values of the adjusted baf >1 are one of the signals that there may be more going on than can be captured by the baf.tsv table, and by this software. Including values >1 in the output was a deliberate choice to convey this signal to downstream analysis.

Could the baf adjustment be improved? Yes. Tackling the second scenario above would require some assumptions about read-depth variability, but it is doable and would be useful. Tackling the issues with structural variants is more complicated and would require thought. You are welcome to work in this direction if it interests you!

What to do about baf estimates >1? This depends on your data and what you would like to achieve in downstream analysis. One approach could be to cap the estimates at unity (replace values >1 by 1). Or, you might want to analyze cases with adjusted baf >1 separately. Sorry I can't be more concrete here. This is largely an unexplored area still.

Hope this helps.

qsonehara commented 2 years ago

Thank you for your clear explanation! It is really helpful. I will utilize the thesaurus_baf values with >1 (especially very high ones) as quality control criteria for the thesaurus sites.

tkonopka / GeneticThesaurus

Definition of the output BAF file columns #1