kircherlab / CADD-scripts

CADD scripts release for offline scoring. For more information about CADD, please visit our website
http://cadd.gs.washington.edu
Other
71 stars 32 forks source link

Splice-AI NA values #39

Closed aardes closed 1 year ago

aardes commented 1 year ago

Dear,

Could you please point me to the definitions for the NA values for the Splice-AI.

There are duplicate records for a variant due to the relation with more than one gene or one transcript.

there are scores sometimes for one of the genes and for the other one you have NA. So I am a bit confused and I would like to know what is the logic behind this.

Here is an example.

10 73483791 T A NA ENSR00000980981 NA NA NA NA NA 10 73483791 T A ENSG00000214688 ENST00000441508 C10orf105 NA NA NA NA 10 73483791 T A ENSG00000107736 ENST00000224721 CDH23 0.07 0.04 0 0 10 73483791 T C NA ENSR00000980981 NA NA NA NA NA 10 73483791 T C ENSG00000107736 ENST00000224721 CDH23 0.02 0.02 0 0 10 73483791 T C ENSG00000214688 ENST00000441508 C10orf105 NA NA NA NA

Thanks in advance

aerval commented 1 year ago

Hej,

Thank you for your question. Gene-specificity is a property of SpliceAI that is only inherited in CADD. SpliceAI is a deep neural network that predicts splicing effects based on a 10,000 bp sequence window around the position, however only if there is a gene annotated at that site (and limited to that gene). CADD is annotating a number of properties different properties (i.e all the different amino-acid consequences) based on a per gene basis in order to not mix information on multiple genes (i.e amino acid alteration in one gene and regulatory element modulation of another). The final CADD score is calculated per gene but annotated as the maximum at that position for all genes.

Note that in your example, the location is within the gene CDH23 (position is in an intron within 11bp of a known splice site). The gene C10orf105 is located upstream. This seems to refer to the canonical transcript as defined by Ensembl VEP, and which is more that 3,000 bp away. Hence, SpliceAI (same for MMSplice) was calculated for one but not the other.

Technically, gene annotations from SpliceAI are also sometimes slightly different from the Ensembl ones (as we are only repurposing the genome-wide files as provided by the SpliceAI authors). It could hence also be the case that a barely described open-reading-frame based gene is not included in the SpliceAI gene list. However, this does not seem to be the case here.

I hope this answers your question

aardes commented 1 year ago

Hi,

Thanks a lot,

Now it's clear to me, really appreciate it.