add common disease-phenotype associations

nlwashington commented 9 years ago

we need to load in the common disease associations.

the repo is here: https://github.com/monarch-initiative/hpo-annotation-data

@drseb this seems to be a mixture of all the diseases. will you be splitting these up between rare and common subdirs?

shall i go straight from these files, or will there be a downstream job (as there is/was for the rare-diseases), that aggregates all the annotations into a single file? how does this align/integrate with the hudson job?

pnrobinson commented 9 years ago

Hi everybody, I think we should definitely keep the rare and common separate for now, at least until we have improved the common disease annotations to level of the rare!

Dr. med. Peter N. Robinson, MSc. Professor of Medical Genomics Professor in the Bioinformatics Division of the Department of Mathematics and Computer Science of the Freie Universität Berlin Institut für Medizinische Genetik und Humangenetik Charité - Universitätsmedizin Berlin Augustenburger Platz 1 13353 Berlin Germany +4930 450566006 Mobile: 0160 93769872 peter.robinson@charite.de http://compbio.charite.de http://www.human-phenotype-ontology.org Introduction to Bio-Ontologies: http://www.crcpress.com/product/isbn/9781439836651 I have learned from my mistakes, and I am sure I can repeat them exactly ORCID ID:http://orcid.org/0000-0002-0736-9199 Scopus Author ID 7403719646 Appointment request: http://doodle.com/pnrobinson

Von: Nicole Washington [notifications@github.com] Gesendet: Freitag, 24. April 2015 00:34 An: monarch-initiative/dipper Betreff: [dipper] add common disease-phenotype associations (#116)

we need to load in the common disease associations.

the repo is here: https://github.com/monarch-initiative/hpo-annotation-data

@drsebhttps://github.com/drseb this seems to be a mixture of all the diseases. will you be splitting these up between rare and common subdirs?

shall i go straight from these files, or will there be a downstream job (as there is/was for the rare-diseases), that aggregates all the annotations into a single file? how does this align/integrate with the hudson job?

— Reply to this email directly or view it on GitHubhttps://github.com/monarch-initiative/dipper/issues/116.

nlwashington commented 8 years ago

do we want this for the TG release, @cmungall ?

jmcmurry commented 8 years ago

When this ticket is closed, we should revisit the (provisionally-closed) https://github.com/monarch-initiative/monarch-app/issues/294 to make sure that changes propagate all the way to UI.

cmungall commented 8 years ago

Yes

nlwashington commented 8 years ago

if this is going to go into the TG release, i need this data to be available in a reasonably-stable format (as in num columns/file format) so I can write a parser. I suggest I need to have it by the end of October. i personally don't care if the data itself is changing. can @tudorgroza or @drseb give me an estimated date to get started working on it?

drseb commented 8 years ago

For now I would suggest to simply parse the annotation-data as it is.

When Tudor re-generates the annotation-data we can use it as suggestions for the manual annotation. I don't have the feeling (correct me if I'm wrong @tudorgroza ) that Tudor will soon provide us an annotation-dataset that can completely replace the old annotation-dataset.

The hudson-job for the rare-diseases has several tasks:

map alt_ids to primary_ids
remove duplicated annotations (the definition of duplicated is a bit fuzzy)
throw an exception if obsoleted classes have been used
...

I can generate a similar job for the common disease, but I have to teach this and the following week.

tudorgroza commented 8 years ago

My understanding - from the last email thread about this - was that we go ahead and ingest the old data till I get my head around what's wrong with the new data. Ideally, it should be done in a way in which we can easily replace it.

The format is the one we've used to publish the data on the pubmed browser site - unless you want me to re-format it. Currently the columns are:

MeSH ID
MeSH Label
DOID
HPO ID
HPO Label
TF-IDF Score
Total number of publications
5 random PMIDS containing the association.

Let me know if you want me to re-format it / split it / do something else with it.

drseb commented 8 years ago

Ok. Then lets decide now, that we take the old data and start re-curating these.

The annotation-files in hpo-annotation-data/common-diseases are in Phenote-format. This is the same as the rare-diseases. Should we transfer this to the HPO-format by means of a Hudson/Jenkins-job?

mellybelly commented 8 years ago

not to complicate things, but would like to converge on common format for all our annotations that can easily be ingested into PhenoTua. @cmungall please comment

drseb commented 8 years ago

Not sure what you mean @mellybelly : common and rare diseases annotations are already in the same format.

cmungall commented 8 years ago

Multiple formats are fine if they conform to the same model. Sometimes TSVs are most conventient but this is inherently most limited.

nlwashington commented 8 years ago

@tudorgroza, i notice that in all files the following columns are empty: Gene ID, Gene Name, Frequency, Sex ID, Sex Name, Negation ID, Negation Name. Is that expected?

nlwashington commented 8 years ago

i will clean it up in my parser, but we notice that all the doids are zero-padded (which is not actually correct-only some should be), and they are prefixed with DOID-DOID: instead of just DOID:. best if they're cleaned upstream (see https://github.com/monarch-initiative/hpo-annotation-data/issues/84 and https://github.com/monarch-initiative/hpo-annotation-data/issues/85).

drseb commented 8 years ago

I put the data there. They are in same format as the rare disease files, so that they could be edited in Phenote. Tudors format can be downloaded from pubmed-browser

nlwashington commented 8 years ago

is the data in http://pubmed-browser.human-phenotype-ontology.org/ the same as in https://github.com/monarch-initiative/hpo-annotation-data/common-diseases ?

nlwashington commented 8 years ago

this will be deployed on bamboo as soon at the GitPython library is added (requires crbs support, and is thus blocked [JIRA ticket].

pnrobinson commented 8 years ago

It should be the same...

nlwashington commented 8 years ago

i've now finished adding the common annotations. we should be able to view after the next graph load. because these are both part of the "HPO" source, they will show up as with the HPO icon. if that isn't desirable, please open a new ticket to let me know what the correct source name and icon should be.

monarch-initiative / dipper

add common disease-phenotype associations #116