monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

Multiple different `mondo.owl` etc usage / refreshes per build #592

Open joeflack4 opened 3 months ago

joeflack4 commented 3 months ago

Overview

This is minor because this is not likely to happen very often, and when it does happen, I'm not sure how much of a major issue it would be. But it could create very confusing scenarios if it does happen.

Since #551, now whenever a goal is run which has requires mondo.owl or mondo.ssom.tsv (or possibly in the future any other output which requires the refresh of a cloned mondo repo to generate such a file as with these files), it checks if there has been any updates to main in the mondo repo. If so, it refreshes the repo, pulling down that most recent commit, and rebuilds these files. Between commits, it is possible that there will be variations in these outputs. Since these files mondo.owl and mondo.sssom.tsv are needed at multiple points in the pipeline, it's possible that a new commit could occur mid-build, such that the versions of these files used in some goals earlier in the build are different than later in the build.

Possible solutions

Perhaps one solution would be, at the start of build-mondo-ingest, to create a file (delete it first if it already exist) or an environmental variable that stores a boolean about whether or not the "refresh mondo repo" goal has been ran. So at first, we would set this to false, or maybe we just wouldn't set the variable / create the file. If the refresh mondo goal is run, we then create the file / set the variable to true. And, every time that the refresh mondo goal is run, it checks if this file / variable is true, and if so, it skips. I suppose an environmental variable is better, since we would have to set this to false / delete at the end of the build, and if there was an error during the build, this would be easier to do if an environmental variable (automatic) than a file, which would require fancy error handling that I'm not sure is possible.

Related

Other intelligent refresh related issues:

matentzn commented 3 months ago

If mondo ingest refresh is not phony, i.e. generates a file during the process, a make run will not try to run it multiple times. So once it checked once if there was something to updated, and a file is written, make will know its looked at the goal, and wont look again for the duration of the entire pipeline.

joeflack4 commented 3 months ago

I can't put much thought into this right now, but that's incorrect, I believe, based on the very idiosyncratic way that this new "intelligent refresh" works for these goals.

The dependencies are no longer in the 'prereqs' section of the goal. They're called in the body. A check happens, and if an update to mondo has been made, then these files get rebuilt. So it is theoretically possible for this to happen multiple times during a build if a commit occurs mid-build. That's what this issue is about.

I wouldn't worry about it too much; it's a low priority / edge case.

joeflack4 commented 1 month ago

@twhetzel Just noticing that some days main in mondo does indeed get updated often. IDK how much of a problem this can cause for us.

What the issue is: During the build, tmp/mondo/ and its artefacts can refresh multiple times.

Causes problems:

  1. Performance: This takes like 15min~ each time
  2. Inconsistency: Multiple versions of these files could be used by different parts of the pipeline
matentzn commented 1 month ago

I agree, I was bugged by this as well. This is the week the curators do a lot of work on Mondo, so basically we had 3-5 new commits each day. For testing this is annoying. I added a feature to simply skip refreshing during development, which does not affect the default behaviour:

https://github.com/monarch-initiative/mondo-ingest/commit/822becde7eb2af1d29a1836f1f18a9b2b4e4c84a