Question on the bed files

Bin-Guan commented 3 years ago

Hi, I used the following variants in KCNQ2 to test the CCR bed file for annotation, and found that the bed regions do not make sense. The tested variants are the three overlapping a nucleotide with CCR percentile 0 according to the browser (https://s3.us-east-2.amazonaws.com/ccrs/ccr.html). I expect the first line 20-62076114-C-A to be 0 CCR percentile, since it's synonymous. I suspect the bed file coordinates are off by 1? Thanks. 20 62076114 KCNQ2:p.A196A C A . . ccr_pct=99.7099;fathmm_xf_coding=0.008712 20 62076115 KCNQ2:p.A196V G A . . ccr_pct=0;ClinPred_Score=0.99;fathmm_xf_coding=0.9454 20 62076116 KCNQ2:p.A196S C A . . ccr_pct=67.1065;ClinPred_Score=0.9841;fathmm_xf_coding=0.9445

The CCR bed file looks like this: 20 62076011 62076114 99.709937837 KCNQ2 62073863-62073884,62076011-62076114 .... 20 62076114 62076115 0.000000000 KCNQ2 62076114-62076115 ... 20 62076115 62076130 67.106472851 KCNQ2 62076115-62076130 ...

jimhavrilla commented 3 years ago

Annotation looks good to me. Did you make sure the overlapping variant passes all filters? Also CCRs did not use synonymous variants in their construction. Not sure what your issue is from how this question is phrased. The values you have for ccr_pct for each are correct according to the CCR file coordinates you provided.

Jim Havrilla

On Sat, Apr 3, 2021, 10:45 PM Bin-Guan @.***> wrote:

Hi, I used the following variants in KCNQ2 to test the CCR bed file for annotation, and found that the bed regions do not make sense. The tested variants are the three overlapping a nucleotide with CCR percentile 0 according to the browser (https://s3.us-east-2.amazonaws.com/ccrs/ccr.html). I expect the first line 20-62076114-C-A to be 0 CCR percentile, since it's synonymous. I suspect the bed file coordinates are off by 1? Thanks. 20 62076114 KCNQ2:p.A196A C A . . ccr_pct=99.7099;fathmm_xf_coding=0.008712 20 62076115 KCNQ2:p.A196V G A . . ccr_pct=0;ClinPred_Score=0.99;fathmm_xf_coding=0.9454 20 62076116 KCNQ2:p.A196S C A . . ccr_pct=67.1065;ClinPred_Score=0.9841;fathmm_xf_coding=0.9445

The CCR bed file looks like this: 20 62076011 62076114 99.709937837 KCNQ2 62073863-62073884,62076011-62076114 .... 20 62076114 62076115 0.000000000 KCNQ2 62076114-62076115 ... 20 62076115 62076130 67.106472851 KCNQ2 62076115-62076130 ...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/quinlan-lab/ccrhtml/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBGUAN3OH2WWYDV6JKLTG7HEBANCNFSM42K4UAVA .

Bin-Guan commented 3 years ago

Thanks for your reply. My question is whether the coordinates in the bed file is off by 1. In the examples I gave above, if there is a nucleotide to be ccr_pct = 0 among the three nucleotides chr20:62076113-62076115, it has to be 62076114 because this is the 3rd nucleotide in the codon 196. Thus, I think the bed file should be like this: 20 62076113 62076114 0.000000000 KCNQ2 62076114-62076115 ...

jimhavrilla commented 3 years ago

No, the coordinates you gave above show the variant at 114 should be 99.7. CCR interval starts at 0-based open coordinate 111 goes to closed coordinate 114. 115 should be 0 since the next interval is 114-115. The nucleotides in whatever transcript's codon you are using are irrelevant. These are flattened representations of the genome across several transcripts.

On Mon, Apr 5, 2021 at 5:44 PM Bin-Guan @.***> wrote:

Thanks for your reply. My question is whether the coordinates in the bed file is off by 1. In the examples I gave above, if there is a nucleotide to be ccr_pct = 0 among the three nucleotides chr20:62076113-62076115, it has to be 62076114 because this is the 3rd nucleotide in the codon 196. Thus, I think the bed file should be like this: 20 62076113 62076114 0.000000000 KCNQ2 62076114-62076115 ...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/quinlan-lab/ccrhtml/issues/13#issuecomment-813668654, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBGQSUGZ6HU3PJF75V3THIVMVANCNFSM42K4UAVA .

Bin-Guan commented 3 years ago

Even you flatten all KCNQ2 transcripts, coordinate 114's ccr_pct should still be 0, which would also match GERP+ as shown in your CCR browser https://s3.us-east-2.amazonaws.com/ccrs/ccr.html. The GERP++ score at 114 is negative, and GERP++ score at 115 is high. If you scroll to other regions, I bet you'll also find that ccr_pcts are off 1 nt comparing to GERP++.

jimhavrilla commented 3 years ago

I am not sure I see what you are trying to say anymore. What makes you say it would be 0? The CCR is correct. It goes from 012-114 as seen here below: CCR at 115 corresponds with a passing gnomAD variant (you can check that on the browser) which is in 1-based VCF format which gives it a 0: The GERP file is base-by-base with each single base having a value. Not sure what relevance GERP has though...as it has nothing to do with CCR or its calculation. They do not agree in all bases, as GERP is interspecies conservation-based and CCRs are based on human-specific variant constraint.

Bin-Guan commented 3 years ago

If 115 is 0, then 114 cannot be 99.7. I've been trying to remind you the same thing using different reasoning: a nucleotide position at the 3rd position in a codon, GERP score, or another predictor such as ClinPred. I have no doubt that the ccr matches constraint regions when you look from 1000 ft away. But I'm certain that the boundary is off 1 nt. The boundary has to be in agreement with basic biology concept and with other tools.

jimhavrilla commented 3 years ago

Did you look at any of the numbers in the images above? The indication of the gnomAD variant? The only thing that defines the boundaries of the CCRs as described in the paper?

jimhavrilla commented 3 years ago

The basic biological concept of a protein-altering variant within the coding sequence of several KCNQ2 transcripts is shown below... These are the only things that define a CCR boundary, the beginning or end of a gene's coding sequence from GRCh37, or a protein-altering variant in gnomAD version 2.0.1. CCR ends where that variant begins, at the BED file coordinate of 114-115. It is not off by 1 nt, that is where the 0 pct CCR and variant is, 114-115 with the variant being a simple missense variant that you can check here https://www.ncbi.nlm.nih.gov/snp/rs118192199. The next CCR starts right after this variant at 115 (BED coordinate) ending at 130, because the next variant is at 130-131 in BED format, 131 in VCF format. This is why I keep trying to explain that important distinction between 0-based half-open BED vs 1-based VCF format. This is not 1000ft away, these are precise, single nucleotide boundaries.

davemcg commented 3 years ago

Bin, I think you may be getting confused by the different ways of representing positions. As Jim mentioned BED is "0-based half-open." For further reading see:

https://www.biostars.org/p/7041/ http://genome.ucsc.edu/FAQ/FAQformat.html#format1

(I think this is how python, for example, indexes a vector or list)

VCF is 1-based:

http://samtools.github.io/hts-specs/VCFv4.2.pdf (see 1.4.1 "POS").

(This how R indexes a vector)

If you interconvert between BED and VCF you need to transpose the start position by one.

Bin-Guan commented 3 years ago

I actually studied this cheat sheet on 0- or 1- before asking the question: https://www.biostars.org/p/84686/. So I don't think the question is 0-based or 1-based. Here is a probable explanation: CCR only looks at missense variant and totally ignores synonymous variant, thus CCR considers position 114 (1-based) ccr_pct to be 99.7. Thus, position 114 is "constrained" based on gnomAD missense dataset, although gnomAD has a synonymous variant at this position and it has a low GERP++. Position 114 is not constrained in light of being not conserved during evolution and the synonymous change in gnomAD. This would happen to almost all positions when transitioning between the 3rd and the 2nd nt in a codon. Thus, I saw the 1-shift in the ccr browser. A better constrained matrix probably should consider both missense and synonymous changes, and only regions void of synonymous changes is bona fide constrained regions. In practice, the shift probably will not matter because very few people are interested in synonymous changes. Thanks Jim and David for the discussions.

jimhavrilla commented 3 years ago

The discussion is not a problem. If there's nothing else, feel free to close the issue, as this is a repository for the browser, not the model's design.

quinlan-lab / ccrhtml

Question on the bed files #13