monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

QC check for: Set of Mondo terms in `mondo.owl` & `mondo.sssom.tsv` differ #384

Open joeflack4 opened 10 months ago

joeflack4 commented 10 months ago

Overview

I was working on https://github.com/monarch-initiative/mondo-ingest/pull/363#discussion_r1411556110 today. I had just gotten my new computer, so I was finally able to build mondo.db, so I ran the goal for that as well as mondo.sssom.tsv to make sure that both were up-to-date. I expected then that the list of terms would be the same in each (I also accounted for obsolete terms). But that is not the case.

I found 971 terms that were in mondo.db (from mondo.owl) that weren't in mondo.sssom.tsv. I found 2,024 terms that were in mondo.sssom.tsv but not in mondo.db.

out_of_sync_mondo_ids.tsv.zip (Google Sheet)

Update: A. Non-problematic: in ontology only B. Problematic: in SSSOM only

matentzn commented 10 months ago

See my comment here: https://github.com/monarch-initiative/mondo-ingest/pull/363#discussion_r1411817373

joeflack4 commented 10 months ago

Copy/pasting our discussion from aforementioned comment.

Joe wrote:

As written in the issue:

I found 971 terms that were in mondo.db (from mondo.owl) that weren't in mondo.sssom.tsv. I found 2,024 terms that were in mondo.sssom.tsv but not in mondo.db.

The implementation for this changed. Right now I'm just doing a static check of all of the Mondo IDs in mondo.db and mondo.sssom.tsv.

The implementation I had before gave more information based on which terms were missing by source, but I didn't think that information was important, so I changed it.

Before, I was printing a warning for sources where the number of out of sync terms were greater than 200. Now I'm saving a report and running the same check every time (source agnostic), but if it's greater than 200 I'm still warning. But now I'm also always saving this report: out_of_sync_mondo_ids.tsv.zip

Questions

  1. @matentzn If this is a static QC check that does not depend on any source nor on this synchronization work, should I make a separate PR for this? Should I add this to some kind of QC command or a test (maybe eventually QC checks will be run as Python tests?)?
  2. @matentzn Do you want me to change this back to how I had it before? Some kind of breakdown by source, rather than simply checking mondo.db and mondo.sssom.tsv against each other? Or should I do both--both checking those against each other, and some kind of breakdown by source?

Nico wrote:

For reports that require discussion, its better to have them on Google sheets so I can click immediately;

I found 971 terms that were in mondo.db (from mondo.owl) that weren't in mondo.sssom.tsv.

For now, its best to look at once example and then decide if it is an issue. I would have expected around 1000K terms in Mondo that do not have any associated mappings, and therefore do not show up in the mondo.sssom.tsv. So this one seems ok to me.

I found 2,024 terms that were in mondo.sssom.tsv but not in mondo.db.

Again it would be good to look at some examples. Out-of-syncness is only 1 hypothesis; another one is that the sssom file contains 2000 obsolete classes that are not present in mondo.db (which is interesting to know, as I would not agree with the oak design in this case).

I fully agree with a simple QC check that ensures a minimum of X classes are being processed, ideally on a per resource basis; but for now I suspect the cases you found have good explanations.

joeflack4 commented 10 months ago

@matentzn If you want we can discuss at meeting instead of here.

Decided to move conversation here, as (@matentzn correct me if I'm wrong) it sounds like your answers to questions (1) and (2) are 'yes' and 'yes'. Basically, there I can add a QC check and do some breakdown by source, and that should be done in a separate PR.

Some more follow-up questions:

  1. QC: Is there a standard way you have in mind of doing it in mondo-ingest? Maybe this should be a separate issue?
  2. I was thinking of adding this to #319, but I don't know of any failure conditions for this if I turned it into a test. So I suppose right now it's better as a report; part of some separate QC command. So I make an independent PR for this?
  3. Do you want me to keep the warning in the synchronization pipeline? Where the source has more than 200 direct SCRs in Mondo (change it to terms? I had been warning about edges) where at least 1 of the terms in the SCRs is not in mondo.sssom.tsv? It feels redundant to me to keep the warning in the synchronization pipeline if there's to be a standalone QC check.

its better to have them on Google sheets so I can click immediately;

I attached out_of_sync_mondo_ids.tsv.zip yesterday. IDK if you didn't see it, or are just saying you prefer a Google Sheets, or it is possible you noticed it was erroneous! I had it correct yesterday, then tweaked the logic and got bad results. I just fixed that now and re-uploaded. Here's also the Google Sheets version for you.

I see you are saying this difference between the files may not necessarily be problematic. So I just updated the title of the issue. FYI, OAK by default, when using the .entities() method to get list of terms, filters out obsoletes. But for this check, I included the obsoletes. I did see that there are some obsoletes in the mondo.owl/mondo.db and mondo.sssom.tsv, but I didn't inspect beyond that.

matentzn commented 10 months ago

I attached out_of_sync_mondo_ids.tsv.zip yesterday. IDK if you didn't see it, or are just saying you prefer a Google Sheets, or it is possible you noticed it was erroneous!

No I do a lot of my work on the phone and would like to be able to see things immediately rather than having to unzip etc to safe time.

I see you are saying this difference between the files may not necessarily be problematic. So I just updated the title of the issue. FYI, OAK by default, when using the .entities() method to get list of terms, filters out obsoleted.

There seems to be a real issue: I checked for example "MONDO:0029465" and it absolutely should be in both! No idea what could cause it to not be in one of the other. Is the mondo.owl that is at the bottom of mondo.db the latest release version?

  1. QC: Is there a standard way you have in mind of doing it in mondo-ingest? Maybe this should be a separate issue?

Nothing in mind yet!

  1. I was thinking of adding this to https://github.com/monarch-initiative/mondo-ingest/pull/319, but I don't know of any failure conditions for this if I turned it into a test. So I suppose right now it's better as a report; part of some separate QC command. So I make an independent PR for this?

I would first make sure we have an issue that clearly defines the QC check. Then a separate PR.

  1. Do you want me to keep the warning in the synchronization pipeline? Where the source has more than 200 direct SCRs in Mondo (change it to terms? I had been warning about edges) where at least 1 of the terms in the SCRs is not in mondo.sssom.tsv? It feels redundant to me to keep the warning in the synchronization pipeline if there's to be a standalone QC check.

TMD. I would like to ensure that the pipeline breaks if major issues occur, but I neither know what I mean by "major issues" exactly, nor how to recognise these; at least not without thinking. Lets develop some ideas and discuss at a call.

joeflack4 commented 10 months ago

Understood on Google Sheets now!

Ok, we'll discuss how to implement these checks at the meeting.

As for MONDO:0029465 I just discovered that my mondo.owl did not get updated as I had suspected! Sorry about that. I'm re-running and will update the _out_of_sync_mondoids Google Sheet with a new report shortly.

Edit: Good news! MONDO:0029465 no longer appears in the report. Also great news--now there are 0 entries that are "in SSSOM but not in mondo.owl". There are now just 3,499 entires that are "in mondo.owl but not in SSSOM".

matentzn commented 10 months ago

Awesome, I randomly checked 10, can you remove "obsolete classes" from this list as well? all 10 were obsolete.

joeflack4 commented 10 months ago

No problem! I intentionally included the obsoletes to be most thorough. But especially now since the problem is no longer bidirectional, I suppose there is no longer any need.

I just updated: _out_of_sync_mondoids Google Sheet

I also added a labels column.

If you want the obsoletes back, now that I have added the labels, let me know. I could also add an is_obsolete column and add sorting for that.

matentzn commented 10 months ago

Fantastic. In this case, I checked 10 randomly, and they all have no mappings associated with them, so everything seems correct! THANKS for drilling in!

joeflack4 commented 10 months ago

Very good! I updated the issue / project tracker so that this issue is for a QC check and not a bug report.