monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Update HPOA ingest for new file format #573

Closed kshefchek closed 5 years ago

kshefchek commented 6 years ago

When the new HPO file format goes live we will need to update our parser.

pnrobinson commented 6 years ago

It is live, see here http://compbio.charite.de/jenkins/job/hpo.annotations.2018/ phenotype.hpoa

pnrobinson commented 6 years ago

Note that there is also an improved parser for this format in phenol

kshefchek commented 6 years ago

It would be great to reuse phenol here, is it too much to ask for the parser to output the rdf model needed for our pipeline. I could send over a sample that approximates the modeling, but it might not include every edge case. The full file is here https://data.monarchinitiative.org/ttl/hpoa.ttl cc @yy20716

cmungall commented 6 years ago

There is a lot of framework in place for doing this in ontobio, but the advantage of doing in phenol is that it keeps the RDF modeling well-coupled to the reference java object model.

pnrobinson commented 6 years ago

@yy20716 @cmungall Is this something we want to do within phenol? Should there be a module of phenol that would use Jena (or something like that) as an adapter? Or should this be a separate app? It should not be hard to do. Note also @kshefchek that we now have additional infos with the improved annotation format and so we should also update the schema

cmungall commented 6 years ago

I think a separate module makes sense. This could be phenol-rdf in the main phenol repo - or an entirely separate library. There may be some advantages in having it more closely integrated. For example, it would be possible to synchronize the in-memory associations with a jena model and then do useful SPARQL queries. It might also be useful to have a little SPARQL form embedded in tools like PhenoteFX for power users.

yy20716 commented 6 years ago

@kshefchek Sorry for a late response. I plan to extend phenol's io package as requested, so that it reads phenotype.hpoa file and produces hpoa.ttl (like the one in https://data.monarchinitiative.org/ttl). The problem is that it's still not clear for me how the internal graph in phenol should be mapped to the hpoa.ttl. I asked this question in Gitter and Tom suggested me to check

but I guess that the problem maybe I am not very familiar with Dipper's internal and overall flow. In readme.MD, I see that the link for "best-practices documentation for details on writing new Source parsers using Dipper code ..." but it seems that the link is dead. If you don't mind, can you please point out some documents that I can check for this task? If possible, sharing any sample files would be helpful as well. Thank you.

TomConlin commented 6 years ago

To be clear Matt's s ingest artifacts repo is https://github.com/monarch-initiative/ingest-artifacts and contains concept maps of ontological intent.

The graphviz reports generated by me at https://data.monarchinitiative.org/*/dot/*gv
are derived from the dipper RDF output and represent a slice of observed reality.

yy20716 commented 6 years ago

Tom helped me to understand an overall flow of Dipper. After then, I've been checking the hpo annotation file format manuals and HPOAnnotations.py for this issue. I have some doubts and would greatly appreciate it if you could clarify them.

  1. If I understood correctly, I see that Dipper's codes for HPOAnnotations,py does not produce RDF triples directly but produce them via the functions under Dipper/Model, which indirectly generate rdf triples based on the OBAN's model. In Phenol, I see that HpoDiseaseAnnotationParser's parse method returns maps that contain HpoDisease instances but these instance are not currently used inside Phenol (except the codes for parsing phenotype.hpoa). My understanding is that, like Dipper's case, these instances eventually need to transform into triples based on OBAN's model for this task. I guess we have a number of options here:
  1. When I see HPO annotation file formats, there are four possible values for aspects but only M's descriptions are different, i.e. M means Mortality/Aging in the previous version while M means Clinical Modifier subontology in the current version. When I see Dipper's codes, each aspect is currently differently mapped, so I wonder these M are different ones. @pnrobinson, @TomConlin, @kshefchek, if they are different, can I ask your opinions how the new M needs to be handled (or mapped into rdf graphs)?

  2. I also would like to know how to map newly added field 'Sex'. It seems that phenol currently does not have codes that handle this field in HpoDiseaseAnnotationParser.java. Is it okay to ignore this field for now?

Thank you.

pnrobinson commented 6 years ago

@yy20716 afaik the Dipper code is out of date with respect to the new annotation model. It would be good to make things consistent across Monarch, and so maybe we can meet with the Dipper team and go through the new data model. We will need to implement some additional items in the Java code to take the improvements of the recent switch into account, so unfortunately, even the Java code is out of date...

kshefchek commented 5 years ago

fixed with https://github.com/monarch-initiative/dipper/pull/671