Closed bwalsh closed 1 week ago
Was thinking through this work last week and forgot to follow up. First addressing...
"does the participant have the allele?" Last week, Kyle explained that as a first pass, a participant can be considered to have a variant if it has at least 1 matching allele (ie a GT of 0/1, 1/0, or 1/1), then in later passes we can worry about the filtering etc in the format fields. So for a variant ID with the same assembly (HGVS, gnomAD GRCh37, gnomAD GRCh38, etc) as the original VCF, using tabix with the gsutil indexing you found is sufficient to determine a participant-allele match. However, if the assembly between VCF and the user-specified variant of interest is different, then we will need to make use of VRS Start and Stop attributes (normalized across assemblies) instead of just what exists in the VCF to deal with the difference.
Given this, there are a few things I could focus on without overlapping with James' parquet work...
Thoughts?
Moving documentation out of the code to enable pytests
"""
In Variant Call Format (VCF) files, GT stands for genotype, which is encoded as allele values separated by a slash (/) or vertical pipe (|):
0: The reference base
1: The first entry in the ALT column
2: The second allele listed in ALT
Forward slash (/): Indicates that no phasing information is available
Vertical pipe (|): Indicates that the genotype is phased
"""
"""
documentation for values
e.g.: GT:AD:GQ:RGQ, GT:AD:FT:GQ:RGQ
`0/0:.:40:.` vs `1/1:0,16:45:123`
from header
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
# see https://samtools.github.io/hts-specs/VCFv4.1.pdf
In Variant Call Format (VCF) files, GT stands for genotype, which is encoded as allele values separated by a slash (/) or vertical pipe (|):
0: The reference base
1: The first entry in the ALT column
2: The second allele listed in ALT
Forward slash (/): Indicates that no phasing information is available
Vertical pipe (|): Indicates that the genotype is phased
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
see https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format#:~:text=Allele%20depth%20(AD)%20and%20depth,each%20of%20the%20reported%20alleles.
In a Variant Call Format (VCF) file, AD stands for Allele Depth,
which is the number of reads that support each allele.
The AD field is an array that includes the reference allele as the first entry,
and the remaining entries are for each alternate allele at that locus
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
see https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format#:~:text=PL%20is%20calculated.-,GQ,how%20they%20should%20be%20used.
In Variant Call Format (VCF), GQ stands for Genotype Quality and is a measure of how confident a genotype assignment is. GQ is calculated by taking the difference between the PL of the second most likely genotype and the PL of the most likely genotype. The PLs are normalized so that the most likely PL is always 0, so the GQ is usually equal to the second smallest PL. However, the GQ is capped at 99, so if the second most likely PL is greater than 99, the GQ will be 99.
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
see https://support.researchallofus.org/hc/en-us/articles/4614687617556-How-the-All-of-Us-Genomic-data-are-organized
Reference Genotype Quality (RGQ) -- The phred-scaled confidence that the reference genotypes are correct. A higher score indicates a higher confidence. For more information on RGQ, please see the GQ documentation, but note that RGQ applies to the reference, not the variant. For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
"""
Created tests to
Next steps are combining this into a singular function that can take in as parameters...
and return CAF dicts
Acceptance criteria
User Story
As a GREGoR analyst, I want to be able to get allele counts and any aggregated phenotype information for a specified variant in the GREGoR cohort so that I can use it in downstream analyses.
A proof of concept flow will be done that generates a cohort allele frequency object given a variant of interest, VCF, phenotype of interest (optional), and any precomputed indices
In-scope / Design
Out of scope / Future Work
This PR: