macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
Other
122 stars 55 forks source link

Added inheritance_modes and other new columns. Closes #11. #12

Closed bw2 closed 8 years ago

bw2 commented 8 years ago

Added new fields from ClinVarFullRelease_00-latest.xml.gz:

inheritance_modes
age_of_onset
prevalence
disease_mechanism
xrefs

Also removed the uncompressed version of clinvar.tsv since it now exceeds github's 100Mb limit.

Examples:

Example rows with the new columns.

Example 1:

$1               chrom : 7
$2                 pos : 65425984
$3                 ref : G
$4                 alt : A
$5                 mut : ALT
$6       measureset_id : 893
$7      all_submitters : OMIM
$8          all_traits : Mucopolysaccharidosis type VII;MUCOPOLYSACCHARIDOSIS, TYPE VII
$9           all_pmids : 1702266
$10  inheritance_modes : Autosomal recessive inheritance
$11       age_of_onset : Childhood
$12         prevalence : <1 / 1 000 000
$13  disease_mechanism :
$14              xrefs : Genetic Alliance:Mucopolysaccharidosis+type+VII/4922;Genetic Testing Registry (GTR):GTR000502907;Genetic Testing Registry (GTR):GTR000506496;Genetic Testing Registry (GTR):GTR000519366;Genetic Testing Registry (GTR):GTR000519384;Genetic Testing Registry (GTR):GTR000528277;Genetic Testing Registry (GTR):GTR000551442;MedGen:C0085132;OMIM:253220;Office of Rare Diseases:7096;Orphanet:584

Example 2:

$1               chrom : 16
$2                 pos : 3293424
$3                 ref : T
$4                 alt : C
$5                 mut : ALT
$6       measureset_id : 97483
$7      all_submitters : Unité médicale des maladies autoinflammatoires, CHRU Montpellier
$8          all_traits : Familial Mediterranean fever;Familial Mediterranean fever
$9           all_pmids : 20301405,23742958,25628446
$10  inheritance_modes : Autosomal recessive inheritance
$11       age_of_onset : Adolescent
$12         prevalence : >1 / 1000
$13  disease_mechanism : gain of function
$14              xrefs : GeneReviews:NBK1227;Genetic Alliance:Familial+Mediterranean+fever/2756;Genetic Testing Registry (GTR):GTR000317682;Genetic Testing Registry (GTR):GTR000320963;Genetic Testing Registry (GTR):GTR000327767;Genetic Testing Registry (GTR):GTR000501207;Genetic Testing Registry (GTR):GTR000501486;Genetic Testing Registry (GTR):GTR000506386;Genetic Testing Registry (GTR):GTR000507864;Genetic Testing Registry (GTR):GTR000508733;Genetic Testing Registry (GTR):GTR000508985;Genetic Testing Registry (GTR):GTR000523787;Genetic Testing Registry (GTR):GTR000528905;Genetic Testing Registry (GTR):GTR000529138;Genetic Testing Registry (GTR):GTR000530037;MedGen:C0031069;OMIM:249100;Office of Rare Diseases:6421;Orphanet:342;SNOMED CT:12579009

Distributions of values:

Each value, along with the number of times it occurs in the Sept. 1, 2016 release.

inhertiance_modes: 19% of variants have this populated

44775 Autosomal dominant inheritance
2082 X-linked inheritance
1191 Somatic mutation
10392 Autosomal recessive inheritance
 630 X-linked recessive inheritance
 573 X-linked dominant inheritance
 202 Sporadic
 112 Mitochondrial inheritance
  75 Other
  37 Codominant
  30 Autosomal unknown
  14 Sex-limited autosomal dominant

age_of_onset:

9736 Adult
7479 Childhood
3582 Adolescent
2809 Antenatal
2541 Neonatal
12299 All ages
10197 Infancy
 413 Neonatal/infancy
  17 Variable
   1 Adolescence / Young adulthood

prevalence:

9139 <1 / 1 000 000
7402 1-9 / 1 000 000
5625 Hereditary breast and ovarian cancer (HBOC) resulting from mutations in BRCA1 and BRCA2 is the most common form of both hereditary breast and ovarian cancers and occurs in all ethnic and racial populations. The overall prevalence of BRCA1/2 mutations is estimated to be from 1:400 to 1:800 [Ford et al 1994, Claus et al 1996, Whittemore et al 1997], but varies depending on ethnicity.
4084 1-5 / 10 000
1981 http://www.ncbi.nlm.nih.gov/books/NBK1247/
19521 1-9 / 100 000
1184 1:3200
1184 1 in 2000-4000 depending on the population studied.
 598 2.29 to 3.2 per 100,000 individuals
 457 1:3,000
 382 >1 / 1000
 335 Rett syndrome is an X linked condition that occurs in 1 in 10,000 to 1 in 15,000 live births.
 273 Cornelia de Lange syndrome occurs in 1 in 10-100,000 live births.
 273 CdLS occurs in 1 in 10-100,000 live births.
 226 1 per million
 211 Sotos syndrome occurs in 1 in 14,000 live births.
 211 Sotos syndrome is an autosomal dominant condition that occurs in 1 in 14,000 live births.
 198 1 in 5,000 male live births
 188 The prevalence of MEN 2 has been estimated at 1:35,000 [DeLellis et al 2004].
 188 1/35000
 174 The prevalence of AS is one in 12,000-20,000 population.
 174 The prevalence of AS is one in 12,000-20,000 population
 150 Occurs in 1 in approximately 32,000 live births
...

disease mechanism: 13% of variants have this populated. 287 genes have 'loss of function' variants and 42 genes have 'gain of function'

  18052 loss of function
 762 Disease mechanisms vary by gene.
 681 gain of function
 155 Fabry disease is due to inactivating mutations in the X-linked GLA gene resulting in deficiency of the enzyme Alpha Galactosidase-A.;loss of function
  34 Affects gamma-sarcoglycan and also disrupts the integrity of the entire sarcoglycan complex.
  18 May be benign
  17 unknown
  12 Other
   2 gain of function;loss of function
   2 Disease mechanisms vary by gene.;loss of function
   1 Dominant Negative