monarch-initiative / helpdesk

The Monarch Initiative Helpdesk
BSD 3-Clause "New" or "Revised" License
7 stars 0 forks source link

Ontology for Gene Expression #63

Closed ahosseinian closed 2 years ago

ahosseinian commented 2 years ago

Hello,

I want to develop a GDBMS for gene expression using in-house data. I want to make it compatible with Monarch to be able to use Monarch's API to pull more data. I have some questions and I'd really appreciate it if you can answer them for me.

  1. What ontology is monarch using for gene expression data?
  2. Do you know any previous work on this that you can refer me to?
  3. Is there any tutorial for setting up a Monarch's ontology on a local system?

Thanks!

matentzn commented 2 years ago

Hey @ahosseinian! Thank you for your interest.

I am from the ontology side of things at Monarch and not the greatest expert in gene expression data (I am more involved in the gene-disease, gene-phenotype, disease-phenotype branch of Monarch), but while we get others to chime in here (@kevinschaper?), I have a few answers/questions for you:

  1. There are many aspects to every association type in Monarch. While I do not know the extent of all ontologies needed for modelling gene expression data, here are some important ones:
  2. There is a lot, but I am not the right person to answer that question.
  3. Can you explain what you mean here? Do you mean a tutorial on
    • How to integrate Phenio in a local tool stack (which python libraries to use?)
    • How to link your locally available data to Phenio?
    • Opening and browsing Phenio?

Can you explain a bit more what kind of data (which species, how the data looks like) you have and the use cases you wish to support?

ahosseinian commented 2 years ago

Hi @matentzn, thank you for your reply!

I appreciate the resources you mentioned. Fortunately, I was able to find a solution to my problem. Another question came up though, and I hope you can answer it for me. I want to add all human/mouse/etc genes to my database. I have worked with Monarch's website and know Monarch can give me that information. I was wondering if there is any tutorial on how to use Monarch's API fast and efficiently. Is it better to clone Monarch's Biolink API or use API directly? Also, what's the difference between Biolink and SciGraph API?

Thank you so much and sorry if I bothered you with my questions.

matentzn commented 2 years ago

@ahosseinian No bother at all, I have forwarded your question to the respective teams :)

kevinschaper commented 2 years ago

Hi @ahosseinian!

I've been working on rebuilding Monarch's ingest pipeline from the ground up, and unfortunately I have somewhat limited experience with the existing API.

There are some caveats to my recommendation, but I feel like this might be one of the most convenient ways to get genes from Monarch.

curl https://storage.googleapis.com/monarch-ingest/latest/monarch-kg.tar.gz | tar -xzO  monarch-kg_nodes.tsv | grep "^id\|biolink:Gene" > genes.tsv

We're moving to using to using the kgx format, following the biolink model

There are plenty of caveats: It's not yet served from a data.monarchinitiative.org url, which means it isn't an ideal stable url yet, which could be an issue in the future. The ID choices may differ and not necessarily line up with the data in the current monarch API. It's also a work in progress, but of anything in this graph, I'd say I most confident about the gene nodes.

I just looked over the API doc, and I don't see a way to paginate over all genes via the existing API. I thought that the search endpoint set to the gene category might work, but it looks like it requires a search term.

--

edit to add: you may also want to look at the edges file for gene expression rows

ahosseinian commented 2 years ago

Hi @kevinschaper,

Thank you for your reply. Would you please let me know how I can access the edge file as well as other files (like variants, phenotype, etc)? Thanks!

kevinschaper commented 2 years ago

The tar.gz file (https://storage.googleapis.com/monarch-ingest/latest/monarch-kg.tar.gz) has a nodes file containing all of the entities (genes, pathways, diseases, phenotypic features, etc - no variants yet), and it also has an edge file that contains relationships between the nodes (phenotype, expression, gene interaction, GO annotation)

Some aspects are definitely a bit more "under construction" than others. For example, we just changed the ontology import process, and right now all of the disease, anatomy, and phenotype ontology terms are just labelled with the broadest possible category of biolink:NamedThing - but depending on how you choose to subset, that might not be an issue, since they still have their correct ID namespace.

There are just under a million gene expression edges (category: biolink:GeneToExpressionSiteAssociation) in the edges file, which are all imported from the Alliance of Genome Resources.

ahosseinian commented 2 years ago

Found them, thanks @kevinschaper . Do you have any estimation of when you'll add variants?

kevinschaper commented 2 years ago

I'm not 100% sure, but I don't think they'll be coming in any time soon. Right now we're focused on a somewhat tighter data model, and a rewrite of our whole software stack against it.

ahosseinian commented 2 years ago

I see. Thanks!