monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

Intelligently refresh `mondo.sssom.tsv` & `mondo.owl` #538

Closed joeflack4 closed 1 week ago

joeflack4 commented 1 month ago

Overview

In non-build situations, goals that use tmp/mondo.sssom.tsv will often an outdated version. This is because it is dependent on tmp/mondo_repo_built, which is not triggered in the normal make way. Normally a make goal runs if the target of the goal hasn't been created, or if any of its prerequisites are newer than the target. In this case, one of the prerequisites is the mondo repo itself.

During a fresh build, this will not be a problem. It'll only be a problem in development situations.

Possible solutions

Originally we were considering other solutions, but we chose to go with (b).

Both options involve at least this change: Move dependency into the goal's body. In this example, that would change it to:

reports/ordo-subsets.robot.template.tsv: tmp/ordo-subsets.tsv 
    $(MAKE) tmp/mondo.sssom.tsv
    python3 $(SCRIPTSDIR)/ordo_subsets.py \
    ...

a. ~Always run a full refresh of tmp/mondo.sssom.tsv~

Details

https://github.com/monarch-initiative/mondo-ingest/issues/529#issuecomment-2123315334 Always run a refresh each time a goal is run which has it as a dependency. This involves nothing else in addition to the change I just described above. This could be be a real damper on local development. Not always terribly so, though. When I'm developing a goal that uses Python, I usually do all of my iteration in the debugger, and then run the make goal once at the end just to make sure it works.

b. Intelligently refresh tmp/mondo.sssom.tsv when needed

This would be cool to do at some point. Basically, in addition to the change I described above, we'd have an if statement at the top of the goal for tmp/mondo.sssom.tsv or tmp/mondo_repo_built. It would compare the latest git commit hashes on master for the cloned mondo dir in tmp/ and the actual repo on GitHub. If they're the same, do nothing. Else, it will re-run tmp/mondo_repo_built --> tmp/mondo.sssom.tsv.

Additional info

Original context: https://github.com/monarch-initiative/mondo-ingest/pull/531#discussion_r1609048542

joeflack4 commented 1 month ago

I'm in favor of (b).

matentzn commented 1 month ago

(B) is indeed better. It could be realised by creating a make goal "xyzhash" that writes the commit hash to a temporary file, if the file differs from "xyzhash" to copy that temporary file to the file called "xyzhash".

The repo checkout goal simply depends on xyzhash and it's done! This pattern could be widely useful.