jhu-bids / TermHub

Web app and CLI tools for working with biomedical terminologies. https://github.com/orgs/jhu-bids/projects/9/views/7
https://bit.ly/termhub
GNU General Public License v3.0
8 stars 10 forks source link

Graph DB: OAK: Satisfy unmet requirements #516

Closed Sigfried closed 4 months ago

Sigfried commented 1 year ago

Overview

This is a discussion to determine whether OAK meets TermHub/TIMS requirements.

Pros/cons and other issues

In addition to whether OAK meets specific requirements, here's a general pros/cons list, as well as some issues we encountered in TermHub during a test run (I would bet that some of these issues are already resolved).

General pros/cons

#### Pros 1. Community support: Though community is smaller, we have direct access to it 2. Hiring: Though pool limited, can borrow FTEs from our group(s) 3. Promotes use and development of our group(s) software #### Cons 1. Community support: Resources not as widespread, e.g. articles, stackoverflow, LLM support 2. Hiring: More limited pool of talent w/ experience, though can be hard with other options as well 3. Development: Some needs are not ready yet or still in development Documentation: Additionally I don't think there are docs that clearly addresses a lot of the use cases identified here, though OAK otherwise has good documentation and could be further updated. Build time: I believe it takes 2-3 hours to build SemSql. Perhaps this will be much shorter if we skip the robot unnecessary normalization step. I wonder how long loading would take in Neo4J/JanusGraph. If similar, then this is a non-issue. It's not the hugest problem though.

TermHub OAK test run issues

Requirements list

Requirements 1-3 and 5 were originally posted in TIMS dev meeting notes Aug 14. Siggie added '4' here.

If a box is checked, we feel that OAK adequately addresses the requirement. It might also be nice if we as a team could give a rating 1-10 on how well we think OAK addresses it.

Requirements details

1. Transactionality

Solved when we get Postgres adapter.

Details

#### OAK options PostgreSQL What is primary advantage over sqlite? Sqlite is ACID compliant. Answer: PostgreSQL supports more complex transactions and provides a more robust concurrency model ([source](https://www.geeksforgeeks.org/difference-between-sqlite-and-postgresql/)). **Problem**: Adapter doesn't exist yet. But this may not be necessary anyway.


2. CRUD

Solved?: We just write directly to OAK in Posgres or sqlite?

Details

Need to be able to do CRUD (Creates, Reads, Updates, Deletes). _Siggie also wrote:_ The value set data changes frequently -- by the minute or hour -- and needs to be up-to-date in the graph database. #### OAK options i. KGCL **Problems**: Joe: Not sure yet. Need to understand this better to see if it meets our needs. Not familiar w/ KGCL, nor how it might be used w/ OAK / if it addresses needs. ii. Write directly to DB If using Postgres or Sqlite, we can write to the db (I believe only the `statements` table). **Implementation problems**: Not a huge problem, but on our end we have to think about how we want to do this. We could have (a) 2 separate DBs and use OAK only for certain things, or (b) two linked DBs (e.g. TermHub's main DB, and OAK). Options to keep them in sync (a) Write some code that will periodically sync the two, (b) wrap around all of our write functions such that they update both databases, (c) some postgres routines that automatically run when certain tables are updated. **Problems that apply to any option:** Per my comments under _"3. Query complexity" > "OAK options" > "ii. Integration w/ `cx`/`nx` packages"_. If we need `cx`/`nx`, I'm worried we'd have to re-render these additional structure(s) on every create, update, or delete.


3. Web API

Details

TIMS and TermHub aim to use the same database, so needs to be deployed independent of each project and internet accessible. #### OAK options i. Can create one **Problem**: Doesn't exist yet We can also create our own; OAK doesn't need to do so. But that's additional work.


4. Query complexity

Details

I (Joe) think that OAK mainly operates on pre-built functions. So it’s possible there will be queries we want to do that OAK doesn’t support. #### Examples of problems Currently we do our graph analysis using a graph tool -- NetworkX, which could be replaced by OAK -- but it forces our application to do most data retrieval in Postgres, including a lot of graph-like analysis, and a small part of the complex stuff in the graph tool. _(Joe: Not sure what the issue is here, per OAK or otherwise)_ #### OAK options i. Pseudo query language [Chris wrote](https://github.com/jhu-bids/TermHub/issues/516#issuecomment-1684036077): > powerful queries that combine lexical search with graph traversal **Problem**: This is probably different than the power of a full query language (though IDK if we're sure what all of our use cases for that will be). Not seeing anything in the docs close to like a cypher/gremlin/graphql query. Maybe closest is [this page](https://incatools.github.io/ontology-access-kit/guide/relationships-and-graphs.html), but these are still method calls and Python manipulations. ii. OAK likely has what we need [Chris wrote](https://github.com/jhu-bids/TermHub/issues/516#issuecomment-1684698031): > OAK is based on 25 years of experience working with a diverse range of use cases for ontologies from basic science to clinical research, I would be surprised if there was something that was not either immediately possible or easy to implement using OAK as a building block. **Problem**: Chris may be right, and I often trust his instincts. But this is somewhat going on faith. I suppose I should say that what we're doing is more in the realm of web applications and backend infrastructure than "basic science and clinical research", so there may well be different use cases. Additionally, "OAK as a building block" may be true, but I would wonder if some solution we would need to implement that includes OAK + other things could simply be done by one single graph DB, which would be easier to implement and less fragile. iii. Integration w/ `cx`/`nx` packages [Chris said](https://github.com/jhu-bids/TermHub/issues/516#issuecomment-1684698031): ...you can export subgraphs to nx and do all your graph theoretic work there. We also have bridges to cx, so you can easily upload your graphs to ndex... **Problems**: First we need to learn more about `cx`/`nx` in order to evaluate. If one/both of these turns out to be a need, some additional concerns are (i) adds an extra step to our setup pipeline (not a big problem), (ii) web hosting: I worry that this could introduce memory issues or other implementation/stability problems, (iii) 2 APIs: one where we use OAK normally, and another for this, makes for more fragile/confusing code (not huge) (iv) CRUD: I fear this `cx`/`nx` would not work, because if we have users making small updates to our database, my guess is that we would have to do another full re-render in `cx`/`nx`. This problem scales with number of users.


5.1 Data variety: Additional structures

Details

In addition to (i) edges for vocabulary structures, we need (ii) concept metadata, (iii) relationship metadata, (iv) full N3C/OMOP vocabulary tables, and (v) versions of these tables. #### OAK options - edges for vocabulary structures (i): OAK has - concept metadata (ii): ? - relationship metadata (iii): ? - full N3C/OMOP vocabulary tables (iv): Can get via solutions for _"5. Solution for SemanticSQL memory issues"_ - versions (v): ?

5.2. Data variety: ValueSets

Details

We need value sets (OMOP/N3C, eventually ATLAS and VSAC) in the graph database. #### OAK options i. LinkML? LinkML enumerations: https://linkml.io/linkml/intro/tutorial06.html [Chris said](https://github.com/jhu-bids/TermHub/issues/516#issuecomment-1684710711): > OAK delegates anything with value sets modeling to LinkML, which handles both intensional and extensional value sets. ...These don't get stored in the sqlite along with the ontology... shouldn't be an obstacle, once I understand the use case more I can provide more examples. **Problems**: This means we have to use multiple interfaces to execute queries. It's not something we can't do, but may be less optimal / harder to maintain than whatever we would need to do that might be done using a single graph db.


6. Solution for SemanticSQL memory issues

Solved?: Moot if we go the Postgres route.

Details

Couldn’t create SemSQL DB for N3C OMOP. Ran out of memory even with 100GB, even when selecting only ~10% of relationship types. (We want a database that we can do graph analysis with really large data. Siggie: Couldn’t get the support we needed to get it done. Joe: I think more so technical limitations than support on this one.) #### OAK options i. Skip robot normalization step Chris mentioned in [this thread](https://github.com/jhu-bids/TermHub/issues/516#issuecomment-1684698031) and [on Slack](https://monarchinitiative.slack.com/archives/C05CJ1M9W73/p1686600919465219). Joe highlights [on Slack](https://monarchinitiative.slack.com/archives/C05CJ1M9W73/p1686606626411909?thread_ts=1686600919.465219&cid=C05CJ1M9W73) where to do in SemSQL codebase. **Problems**: (i) **Memory still high**: Still required 48GB on Chris' machine, even with just ~10% of relationship types, **Possible solutions**: Joe: Add option to SemSql to just set up empty database tables and stream / load in data using the same methods we'd be using for _"1. Transactionality / CRUD"_? (ii) **No CLI option**: Until SemSql adds this option (or programmatically determines if normalization is necessary), this requires more than just simply calling the package; requires custom development set up. ii. Write own loaders using TSV [Chris said](https://github.com/jhu-bids/TermHub/issues/516#issuecomment-1684698031): you can write your own loaders by literally just preparing a few TSVs to bulk upload into sqlite/pg. Joe: Need to learn how. Somehow read ontology (using which tool?) and make a TSV in format of the `statements` table? Then load it into sqlite somehow?

Related

cmungall commented 1 year ago

Can you post (or slack) a link to the ontology you are having issues loading?

joeflack4 commented 1 year ago

@cmungall It's not an ontology per-se, but all of N3C's OMOP, represented as a single ontology: https://github.com/HOT-Ecosystem/n3c-owl-ingest/

The reason we need a single DB is because it needs to be able to walk across mappings between the various OMOP vocabs in a single call.

cmungall commented 1 year ago

can you slack or post a link to the .owl file?

cmungall commented 1 year ago

I think I have an older version you sent to me, this had a syntax error:

        <omoprel:Maps_to rdf:datatype="http://www.w3.org/2001/XMLSchema#string">OMOP:1000560</omoprel:Maps_to>

where no omoprel prefix was declared - not sure if this was fixed

joeflack4 commented 1 year ago

can you slack or post a link to the .owl file?

@cmungall Sure; there's a link in the release, but it was broken anyway. Here: download link (5GB)

I think we fixed the omoprel issue. AFAIK just a memory issue now.

cmungall commented 1 year ago

I don't have access to that google drive link.

the assets are much smaller:

image
Sigfried commented 1 year ago

@cmungall, you can see the raw (csv) data that TermHub is based on here, and here are the file sizes in kilobytes:

   19,312   code_sets.csv
1,858,392   concept.csv
  909,912   concept_ancestor.csv
1,994,360   concept_relationship.csv
    1,640   concept_set_container.csv
      240   concept_set_counts_clamped.csv
2,057,520   concept_set_members.csv
  463,624   concept_set_version_item.csv
   12,480   deidentified_term_usage_by_domain_clamped.csv
      104   relationship.csv
7,317,584   total

We retrieve updates to the frequently updated data through APIs rather than csvs, so a data management solution (even for read-only) would preferably allow API-based inserts/updates rather than (only) a data pipeline from csv to database. We will also be doing more active CRUD originating from our web app, but have delayed building that, in part, until we decide on a long-term data management solution with better graph data support than Postgres provides.

Also, our data-management solution will need to provide high-performance graph queries to support a data-dense, highly-interactive visualization interface to combined vocabulary and value set data.

joeflack4 commented 1 year ago

@cmungall Aw, sorry, I really thought this was shared openly.

I just looked though and it shows that you have access under your @lbl.gov:

Screenshot

Screen Shot 2023-08-15 at 4 21 16 PM

I was unable to change sharing settings in this particular folder. Maybe Shahim just gave you access?

If for some reason it still won't let you download it, let me know and I'll upload another copy somewhere else for you.

cmungall commented 1 year ago

Thanks for the link - I assume n3c.owl was the intended ontology. This one didn't have any syntax errors - thanks! I made a sqlite db of this - shall I deposit it somewhere for you (13G)? Is it OK to deposit in a public s3 bucket?

I did this on my mac laptop. I had to increase memory to 48G for relation-graph to finish, but even this step is optional.

My next step is to make a jupyter notebook demonstrating the functionality you feel is missing (for example I am not sure where you got the idea that OAK can only do a fixed set of queries, one of the main use cases is being able to quickly construct powerful queries that combine lexical search with graph traversal). Where shall we start?

cmungall commented 1 year ago

The structure of the ontology is unusually flat, with the majority of terms being singletons:

sqlite> select count(*) from edge;
889853
sqlite> select count(*) from class_node ;
6902294

for example:

    <!-- https://athena.ohdsi.org/search-terms/terms/1123893 -->

    <owl:Class rdf:about="https://athena.ohdsi.org/search-terms/terms/1123893">
        <rdfs:label>bisoprolol and acetylsalicylic acid; systemic</rdfs:label>
        <terms:concept_class_id>ATC 5th</terms:concept_class_id>
        <terms:concept_code>C07FX04</terms:concept_code>
        <terms:domain_id>Drug</terms:domain_id>
        <terms:standard_concept>C</terms:standard_concept>
        <terms:valid_end_date>2099-12-31</terms:valid_end_date>
        <terms:valid_start_date>1970-01-01</terms:valid_start_date>
        <terms:vocabulary_id>ATC</terms:vocabulary_id>
    </owl:Class>

Before we progress much further, does this reflect what you would expect? Note there is no logical axiom connecting this concept to any other, which seems very odd to me.

If we look at https://athena.ohdsi.org/search-terms/terms/1123893

We see

image

I would expect that to be mapped to a subclass axiom.

Note that OAK is only as good as the input that is provided to it; OAK is probably not going to be very useful on a largely flat ontology of singletons.

I think it may be better to add OAK loaders directly from the raw csvs (is this conforming to omop standard tables) - this should be easy but I don't know when we would get to it.

joeflack4 commented 1 year ago

Successful conversion

Yes, n3c.owl is correct; that's the one I shared in the download link earlier; not sure why was inaccessible before.

Converted? That's super appreciated and also I'm surprised. On @ShahimEssaid Windows PC, he ran out of memory with even ~100GB allocated. That makes me wonder if this n3c.owl is truly the right one, but it is in the Aug 1st folder, so I feel like it should be correct.

I made a folder where you should have edit permissions to upload it: here

Analysis of n3c.owl

Your inquiry here is also very appreciated. We're OK w/ singletons. We originally left out singletons per your recommendation but @Sigfried thinks we do have need for them. I do agree that ideally we want to use 1 API to access information about any node. We could have logic such that "if it's not a singleton, look it up in OAK, else, look it up in Postgres", but ideally not.

As for there being so many singletons, I think this is for 2 reasons: i. Missing rels: some of these concepts are connected via one of the ~90% relationships we had to leave out in order to try to get the SemSql conversion to succeed, and ii. 'Is a' bug there is a bug that you found! I found out why. At first, we only wanted "Is a", and we already had a handy filtered table that only included "Subsumes". Siggie and I thought that these subsumes rels were a direct 1:1 inversion of "is a", so we were just using that table. I've checked now and see that that they are not 1:1! I was already planning on removing this inversion and adding "is a" edges directly. Now I know that I need to do that to fix this bug.

Jupyter notebook

This is appreciated if you do it, but I'm not sure this is the best use of your time (see below section).

I also think we cannot currently anticipate all of the future graph queries that we would need to do if we were to try to enumerate them for you now.

Double-checking OAK meets / should meet requirements

Before you think about Jupyter examples, I think we should be sure that OAK really can address all of the requirements outlined in the OP.

I just edited the OP to better label and enumerate the issues / requirements.

There's some scope creep because I think we just needed more time to think about what we need.

I don't think that OAK can address all of these. Do you think so? Even if it can't yet and you do want OAK to do this in the future, I would wonder why. I think OAK is a great tool for ontology access. It provides things that no other tool is providing. But I don't see why it should become a full-fledged graph database; that sounds like a mistake. Is that really the intent?

I just want to highlight clarification with requirement/issue 1 "SemanticSQL memory issues". The file you converted only used ~10% of the OMOP relationship types. There's a critical subset of relationship types we'd like to convert that is maybe 4-5 times larger, but ideally we want all of them. It's hard to anticipate in advance when we will have some future need for any of these other relationship types.

cmungall commented 1 year ago

Converted? That's super appreciated and also I'm surprised. On @ShahimEssaid Windows PC, he ran out of memory with even ~100GB allocated.

Yes I think I mentioned before you can skip the robot step to normalized to rdf/xml, because it's already normalized.

I made a folder where you should have edit permissions to upload it: here

Thx! Uploaded

I also think we cannot currently anticipate all of the future graph queries that we would need to do if we were to try to enumerate them for you now.

OAK is based on 25 years of experience working with a diverse range of use cases for ontologies from basic science to clinical research, I would be surprised if there was something that was not either immediately possible or easy to implement using OAK as a building block.

i. Missing rels: some of these concepts are connected via one of the ~90% relationships we had to leave out in order to try to get the SemSql conversion to succeed,

Let's see if we can do this with the complete ontology!

I don't think that OAK can address all of these. Do you think so? Even if it can't yet and you do want OAK to do this in the future, I would wonder why. I think OAK is a great tool for ontology access. It provides things that no other tool is providing. But I don't see why it should become a full-fledged graph database; that sounds like a mistake. Is that really the intent?

It's not intended to become a database - in fact it wraps databases. However, graphs are a central data structure in bioinformatics and central to everything we do in Monarch, so it's not surprising OAK should be able to do a lot of this - as I say I don't think the requirements between your monarch world hats and your TIMS/TH hats are so different, which is why I am not convinced we need different technology stacks.

Having said that, I often see a naive tendency for computer scientists to approach ontologies through a lens of graph theory and start producing network centrality statistics and what not, not realizing that an ontology graph is fundamentally structured differently than say a PPI. But if folks want to do that kind of thing, there is a bridge between OAK and nx, you can export subgraphs to nx and do all your graph theoretic work there. We also have bridges to cx, so you can easily upload your graphs to ndex, view in cytoscape... we have your graph needs covered!

You might find this useful reading (for thinking about OAK in TH but also for Mondo): https://incatools.github.io/ontology-access-kit/guide/relationships-and-graphs.html

I just want to highlight clarification with requirement/issue 1 "SemanticSQL memory issues". The file you converted only used ~10% of the OMOP relationship types. There's a critical subset of relationship types we'd like to convert that is maybe 4-5 times larger, but ideally we want all of them. It's hard to anticipate in advance when we will have some future need for any of these other relationship types.

I want to emphasize that semantic-sql is just a layer over relational databases (with most support currently for sqlite but likely pg soon), and RDBMSs are very good for being performant within whatever memory you can give them. The issues you have been having have been loader issues. While I appreciate these are frustrating (I know you also encountered frustrating issues when loading the edit version of mondo), the good news is (a) it's on the roadmap to make the loaders more straightforward, bypassing dependence on robot/owlapi (this has always been the memory issue you faced, it's not in semsql itself) (b) you can write your own loaders by literally just preparing a few TSVs to bulk upload into sqlite/pg.

cmungall commented 1 year ago

I'll address some of other use cases gradually. But I think the most fundamental conceptual mismatch is that OAK delegates anything with value sets modeling to LinkML, which handles both intensional and extensional value sets. So right now these don't get stored in the sqlite along with the ontology (generally these are pretty small) but this shouldn't be an obstacle, once I understand the use case more I can provide more examples.

As far as transactions, as I mentioned the OAK sql adapter can be layered on Pg which has ACID transactions. But I think you mean basic CRUD operations? This is of course handled, and in fact we have a complete change model (KGCL) which I think you should be familiar with from Mondo...

joeflack4 commented 1 year ago

@Sigfried Sorry for editing your post/title. It seemed like the discussion quickly became OAK-centric, so I created a separate issue just for Graph DB requirements based on what you had here and elsewhere. I updated your post to be more OAK specific. I also looked at all the currently open graph db related issues, added missing ones to the milestone, and created a grouping issue. The advantage of the grouping issue in this situation is that it has subgroups.

cmungall commented 1 year ago

do you have any example of value sets so I can add to the notebook (both intensional and extensional)

joeflack4 commented 1 year ago

Hey @cmungall other than grabbing this example from the FHIR docs, I don't have anything handy right now. We could maybe pull something off of VSAC at some point for a better example.

LOINC Serum Cholesterol

I'm not 100% sure that the extensional one is really an expansion of the intensional one. I'd expect to see the the intensional parent in the expansion JSON somewhere but I don't see it.


In any case, if you look at the original post of this issue, you'll see that I organized it in a systematic way. I wanted to get at each requirement/issue, what your proposals were, and any problems identified with each proposal.

I discussed with Lisa and Tim briefly at the TIMS meeting today, but Shahim is out for 2 weeks, and we want to discuss with him before we get back to you more on this.

Jupyter examples would be great, but I would feel bad if you spent much time on it and ultimately determined that the burden of pros/cons and requirements favors another option over OAK.

cmungall commented 1 year ago

Are there any value sets that use the omop IDs directly? Ideally I would come up with a coherent example that demonstrates use of the ontology and value sets together. Or if you can like we can just use the mappings to go from the omop IDs to loinc IDs?

I would have thought there was a place or service that would allow downloading value sets in bulk...

cmungall commented 1 year ago

Transactionality / CRUD

It seems that this is primarily for value sets - for example

rather than changes in the ontology itself. In that case you can disregard my comment on KGCL. As you know from your work on Mondo, KGCL is for describing changes to an ontology.

Note that with OAK, value sets are managed externally, they aren't stored in the same database as the ontology. The default way to handle CRUD is trivial updates on the objects and serialization to YAML. However, if you wanted value sets managed in the same SQL database this is just mere plumbing.

[more later...]

joeflack4 commented 1 year ago

Value sets examples

@cmungall Actually I forgot earlier I could have just used TermHub to get you some examples, and they're use N3C OMOP as well. For bulk download of ValueSets VSAC as I mentioned earlier works as well but it's been a long time since I've used it. You can also sorta do that if you use the N3C enclave APIs, which is what we use to populate TermHub.

Here's the value set / concept set I chose, "[DM]Type2 Diabetes Mellitus (v1)" (view on TermHub):

Value set functionality

@hlehmann17 wrote something in an email that I'll copy/paste here. We'll want to do these operations, and I wonder if they are best done in a graph DB, OAK, or just as well done in a traditional DB or other option (e.g. in memory Python):

CRUD of value sets

I hear you and TBH I don't know what @Sigfried's plan is regarding integrating value sets into the same graph or graphs as the N3C OMOP versions (and possibly other ontologies) that we want in our graph DB. They're on vacation until Wednesday but maybe they can elaborate at some point.

@Sigfried It could be that the CRUD Siggie is thinking might not (only) be for value sets. TBH for value sets my intuition is that a relational model is fine and we don't strictly need them in the graph DB. @Sigfried can you confirm that the only writes we need to do are on concept sets, not on the ontologies themselves? I suppose that makes sense. I mean we will need to add new ontology versions, but we can do this in the build process rather than in a live environment. Further @Sigfried can you elaborate, do what % certain would you say you are that we need concept sets in the graph DB as well? If not, is there really a need for transactions / writes on the graph DB?

KGCL

Depending on how Siggie responds above, perhaps you are right and we don't need it. Also I don't know anything about KGCL other than what it's for and what the acronym stands for.

cmungall commented 1 year ago

We'll want to do these operations, and I wonder if they are best done in a graph DB, OAK, or just as well done in a traditional DB or other option (e.g. in memory Python):

  • Compare value sets
  • Find neighborhoods of value sets
  • Operate on value sets (union, intersection, subtraction)

This is great! This is very much the bread and butter of every bioinformatics ontology library written in the last few decades. I will add some examples to the notebook once I have time to find some suitable value sets. As you may be aware the focus on Monarch is on phenotypic profile comparison - e.g. given a mondo disease, find similar diseases or genes based on properties in common. This is implemented in OAK completely generically so it works for any term set.

I can also demonstrate some things that might not be on your radar yet...

Sigfried commented 1 year ago

Sorry, I've been on vacation. Back now. Immediate thoughts:

joeflack4 commented 11 months ago

Updates from today's meetings.

LinkML enumerations Melissa (edit: reading previously, looks like Chris agrees) thinks that perhaps for us to use OAK to work with value sets, we may need to utilize LinkML enumerations: https://linkml.io/linkml/intro/tutorial06.html

TIMS-specific requirements I think that everything Shahim mentioned is covered in the OP of this issue, except for a Java interface. Personally I think this is a plus, not a requirement. The interface should be HTTP since this will sit in between TermHub/HAPI.

joeflack4 commented 9 months ago

Update from today's meeting with Chris M & Shahim:

The plan is that TIMS will not use OAK and will not create an API for TermHub. It will stick to more FHIR-specific use cases and will try to implement using HAPI's relational DB.

So now the plan is to continue using networkx until we need to switch to OAK, but at that time we will need to work with Chris to satisfy unmet requirements. Hopefully Chris can help out, but it's possible we may be on our own.

The OP hasn't been updated to reflect the latest changes in understanding for each requirement, but my current thinking on each is:

joeflack4 commented 4 months ago

OAK not currently planned.