monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

need pubmed in SciGraph? #274

Open mellybelly opened 8 years ago

mellybelly commented 8 years ago

We need to be able to query all titles and abstracts in pubmed using an ontology. @tudorgroza has something similar working for phenotype profiles.

I know Pubmed is too big to put into our existing SciGraph instance, perhaps we should have a second instance just for these sorts of things?

@jnguyenx @nlwashington @DavidEichmann @cmungall

The ontologies can be NCBItaxon as per recent discussions, but really would be useful for very many use cases (e.g. find me all papers about neural crest (uberon) and coat color (OBA).

cmungall commented 8 years ago

maybe warehousing in our own system is not the way to go. Let's list the kinds of queries we'd want to do on this then figure out the data architecture

kshefchek commented 8 years ago

Per discussion on dev call:

  1. Idea would be to run scigraph annotation service on ftp dump of titles and abstracts using the triple: PMID:1234 IAO_0000142 UBERON:1234
  2. Load these into a scigraph instance along with ontologies
  3. May be faster to write with Java/Scala and connect directly to scigraph (and use the Java API) rather than the http REST annotation service, will benchmark.
cmungall commented 8 years ago

On 29 Feb 2016, at 11:31, Kent Shefchek wrote:

Per discussion on dev call:

  1. Idea would be to run scigraph annotation service on ftp dump of titles and abstracts using the triple: PMID:1234 IAO_0000142 UBERON:1234

You may want to model the full span.

There are some RDF standards for this

But we don't want to over-model. The simple triple model above is probably fine for a first pass. Being able to detect which groups of terms are near eachother (implicit post compositions) will be useful further down the line.

  1. Load these into a scigraph instance along with ontologies

I had imagined two pass. First set up a pure ontology instance that you can hammer for TM purposes. Maybe even load some ontologies we don't want in the main SG-ont instance, e.g. snomed.

The resulting triples would go onto disk, could be analyed by themselves, or loaded into the main SG-data instance as Just Another annotation source

  1. May be faster to write with Java/Scala and connect directly to scigraph (and use the Java API) rather than the http REST annotation service, will benchmark.

Yep


Reply to this email directly or view it on GitHub: https://github.com/monarch-initiative/dipper/issues/274#issuecomment-190347996