monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link

add omim #18

Closed nlwashington closed 8 years ago

nlwashington commented 9 years ago

add omim data source, esp to get classes, labels, and descriptions.

nlwashington commented 9 years ago

initial commit (just gets files) with 0ae696e

nlwashington commented 9 years ago

i've made an initial commit for this source, here: c863aa9 it will dump id, label, description. as well as mark equivalent (moved) and deprecated (removed) classes.

nlwashington commented 9 years ago

some examples of things i saw, but didn’t do yet are:

  1. reformat the label… this might include removing the screaming ALL CAPS titles to be more like title case
  2. add synonyms
nlwashington commented 9 years ago

processing this source requires some REST calls to their server. at the moment we do not cache the raw data (json) from those calls, but we should. (see related ticket #28.)

nlwashington commented 9 years ago

as suggested by @mellybelly, @mbrush can use the output of this file to load into DO for definitions of the omim diseases. please update this ticket with changes that you'd like on a file-wide basis.

nlwashington commented 9 years ago

@cmungall here do you want the pithy disease description to be a dc:description (that's what i have them now), or do you want them to be iao:0000115 definition or http://www.w3.org/2009/08/skos-reference/skos.html#definition? right now scigraph only makes the 'definitions' available in the vocab services, but they ought to be in the graph service. or should it be both a description and definition?

for example: OBO:OMIM_105150 a owl:Class ; rdfs:label "CEREBRAL AMYLOID ANGIOPATHY, CST3-RELATED" ; dc:description "Cerebral amyloid angiopathy (CAA), defined by the deposition of congophilic material in the vessels of the cortex and leptomeninges, is a major cause of intracerebral hemorrhage in the elderly ..." .

cmungall commented 9 years ago

Let's go with definition

Even though it's more of a description than a definition, various points in the consumer chain like definitions (SG, DO editors viewing in Protege

cmungall commented 9 years ago

Some of the descriptions seem to have excessive quoting e.g. OBO:OMIM_101200; not a big deal, FYI

And can we reinsert the code to make the labels a bit friendlier?

nlwashington commented 9 years ago

@cmungall or @pnrobinson, for something that is a susceptibility locus, like: http://omim.org/entry/607339 my automated pipeline dumbly creates, OMIM:607339 has_phenotype OMIM:607339 because this is what the morbidmap file says.

should this omim identifier just be considered a locus, and then assume that it will be mapped to the proper HPO phenotype elsewhere, or should it be considered a "disease" too, and mapped listed as-is, where it can be defined as both a disease and a genomic locus? it would be easy for me to filter out those items where gene_id == phenotype_id when processing the morbidmap, if it is confusing.

pnrobinson commented 9 years ago

I think we should skype about this, it is too complicated for email, but I think there is a solution! I am available any time next week for instance. -Peter

Dr. med. Peter N. Robinson, MSc. Professor of Medical Genomics Professor in the Bioinformatics Division of the Department of Mathematics and Computer Science of the Freie Universität Berlin Institut für Medizinische Genetik und Humangenetik Charité - Universitätsmedizin Berlin Augustenburger Platz 1 13353 Berlin Germany +4930 450566006 Mobile: 0160 93769872 peter.robinson@charite.de http://compbio.charite.de http://www.human-phenotype-ontology.org Introduction to Bio-Ontologies: http://www.crcpress.com/product/isbn/9781439836651 I have learned from my mistakes, and I am sure I can repeat them exactly ORCID ID:http://orcid.org/0000-0002-0736-9199 Scopus Author ID 7403719646 Appointment request: http://doodle.com/pnrobinson


Von: Nicole Washington [notifications@github.com] Gesendet: Samstag, 10. Januar 2015 00:30 An: monarch-initiative/dipper Cc: Robinson, Peter Betreff: Re: [dipper] add omim (#18)

@cmungallhttps://github.com/cmungall or @pnrobinsonhttps://github.com/pnrobinson, for something that is a susceptibility locus, like: http://omim.org/entry/607339 my automated pipeline dumbly creates, OMIM:607339 has_phenotype OMIM:6073399 because this is what the morbidmap file says.

should this omim identifier just be considered a locus, and then assume that it will be mapped to the proper HPO phenotype elsewhere, or should it be considered a "disease" too, and mapped listed as-is, where it can be defined as both a disease and a genomic locus? it would be easy for me to filter out those items where gene_id == phenotype_id when processing the morbidmap, if it is confusing.

— Reply to this email directly or view it on GitHubhttps://github.com/monarch-initiative/dipper/issues/18#issuecomment-69419263.

nlwashington commented 9 years ago

for the labels, i'm making the following modifications:

  1. remove the abbreviation suffixes, make these synonyms
  2. convert the roman numerals to integer numbers
  3. make the text title case, except for conjunctions/prepositions/articles

here's some examples: MUCOPOLYSACCHARIDOSIS, TYPE IIIA; MPS3A --> Mucopolysaccharidosis, Type 3A MUCOPOLYSACCHARIDOSIS, TYPE VII; MPS7 --> Mucopolysaccharidosis, Type 7 MOYAMOYA DISEASE 1; MYMY1 --> Moyamoya Disease 1 MUCOLIPIDOSIS III GAMMA --> Mucolipidosis 3 Gamma MUCOPOLYSACCHARIDOSES, UNCLASSIFIED TYPES --> Mucopolysaccharidoses, Unclassified Types

nlwashington commented 9 years ago

@cmungall, the turtle syntax definition is: "Literals are written either using double-quotes when they do not contain linebreaks like "simple literal" or """long literal""" when they may contain linebreaks. "

so the extra quoting in the definitions reflects this. do you want me to remove the linebreaks instead (and replace them with some kind of separator)?

nlwashington commented 9 years ago

@pnrobinson i've added a request for a new relationship type that we'd be able to use here (and for gwas data, etc.). see: https://code.google.com/p/obo-relations/issues/detail?id=31 please comment with any additional requirements, use cases, disambiguations, etc.

nlwashington commented 9 years ago

@cmungall for omim variants, they are usually referred to in descriptions with an id like 157140.0009, but resolve to http://omim.org/entry/157140#0009. for omim disease URI, we actually use the OMIM purl. what URI should i use for the variants?

cmungall commented 9 years ago

.

nlwashington commented 9 years ago

we can get the omim variants straight from clinvar, so i will punt on pulling the variant info here. edit: we have to at least get the variant labels here for the omim-style variants as they are the authoritative source.

nlwashington commented 9 years ago

Association structure along with variants and links to publications are drawn: screen shot 2015-06-26 at 4 36 49 pm

nlwashington commented 9 years ago

@mbrush please review and close if satisfied.