[wishlist] NeuroDebian use case and inspired features: "local" (to folder) qme, "local" qme as a job for "larger" qme, etc

yarikoptic commented 4 years ago

Description

NB I know that I am too behind on testing our what is already working and providing feedback, but wanted to write this down before I forget.

For building NeuroDebian packages we have a pretty simple setup. A hierarchy of <package>/<version> and then a simple script https://github.com/neurodebian/neurodebian/blob/master/tools/nd_build4allnd looping through releases building them. For each build there is a .build file with the log, and summary.build like

datalad_0.13.0-1~nd80+1_amd64.build     FAILED  0:53.24 real, 39.56 user, 17.70 sys, 119536 out
datalad_0.13.0-1~nd90+1_amd64.build     FAILED  0:48.59 real, 36.15 user, 16.44 sys, 135048 out OLD
datalad_0.13.0-1~nd100+1_amd64.build    OK      46:57.42 real, 1741.84 user, 1268.67 sys, 3287816 out
datalad_0.13.0-1~nd110+1_amd64.build    OK      37:25.13 real, 1266.31 user, 1009.24 sys, 3474200 out
datalad_0.13.0-1~nd+1_amd64.build       OK      37:45.63 real, 1285.95 user, 1018.75 sys, 3570312 out
datalad_0.13.0-1~nd14.04+1_amd64.build  FAILED  1:16.53 real, 56.32 user, 26.00 sys, 122384 out
datalad_0.13.0-1~nd16.04+1_amd64.build  FAILED  1:13.09 real, 54.43 user, 24.59 sys, 134224 out
datalad_0.13.0-1~nd18.04+1_amd64.build  FAILED  48:14.25 real, 1797.37 user, 1294.49 sys, 3222536 out
datalad_0.13.0-1~nd20.04+1_amd64.build  FAILED  1:11.46 real, 53.10 user, 24.53 sys, 165680 out
datalad_0.13.0-1~nd90+1_amd64.build     FAILED  0:46.08 real, 34.83 user, 15.18 sys, 135112 out

visualizing what was build OK or FAILED and how long it took.

I am typically using screen to "submit" builds so I could come later to see -- what have I tried to build and what has succeeded/failed.

What I have loved to be able to do

For a new package build, in a new <package>/<version> (probably just on the top of that nd_build4all) I would run smth like qme start --local --register=above which would
- start qme for that directory (not necessarily web UI, but at least the queue manager if that one is separate)
- "register" it with qme it finds in some "above" directory (would be the one running on the entire user level or a dedicated one ran in the directory hosting <package> directories
instead of nd_build $dfamily $drelease $bpdsc "$@" || : do qme run nd_build $dfamily $drelease $bpdsc "$@". Relevant: #30 (specific background for shell), closed in favor of more general #2 (async).
- Ideally would be nice to have local (or its global) one have configured to use specific executor (e.g. condor #23) so script just works the same regardless of the executor
switch to do whatever else I need to do
come back to that server and be able to
- "globally": qme ls across all packages I had built the overall status -- what is still running, what is done all OK, what is done partially OK (some builds succeeded, some failed), what has failed entirely
  - if in web UI, clicking on a "job" should lead to that "local" qme dashboard with individual packages (assuming that web ui for it is running, or may be this instance of web ui could just switch "context" to navigate qme stored "locally")
  - qme archive (via CLI or some action in web UI) some "jobs" (local qmes) which I consider "done" ( may be qme clear --archive?)
- "locally" (per package),
  - I will qme ls or in web UI see which particular backports failed and be able to access build logs in web UI for review. So somewhat very similar to the aforementioned dump of a summary file: but so that qme ls status (ref #41) could also say "RUNNING" or "PENDING" (if ran via PBS for actual job execution). For the execution time ref (#29 closed-without-fix)
  - Be able to perform additional "actions" per each of those package build jobs:
  - qme rerun ("Rerun" action in web UI) - to rerun the build (e.g. if I updated base environment and expect it now to succeed)
  - qme rerun [MORE OPTIONS] or cli rerun --schema debug where [MORE OPTIONS] would be --hookdir=/home/neurodebian/neurodebian/tools/hooks or schema "debug" would have that setting "configured" so I do not need even to memorize what options what I need to debug that build

vsoch commented 4 years ago

I'm not clear why you would need to do qme start before running any commands. You would just start with a build:

qme run nd_build $dfamily $drelease $bpdsc "$@"

And then there would be an nd_build parser that parses those variables, and can provide actions to run. What we would want to add is a flag that would run the action for all runs in some parser subset, and then the archive command. I'm not totally clear on the other flags you exemplified but they seem kind of complicated. I think what would be helpful to start is to show the complete build / query cycle you would do for one specific package, and include the commands and output. I could create a simple parser for that, and then we can test/discuss adding the --archive and -rerun as a group options. I'm also thinking what we need is (instead of one generic central dashboard that is hard to customize per executor) - an executor specific one that exposes the actions to run on an entire group, archive/clear, etc.

yarikoptic commented 4 years ago

I'm not clear why you would need to do qme start before running any commands.

Might not be needed, but it was primarily to trigger a local qme instance for that folder, and possibly prepare for the async submission of those jobs. But may be even that is not needed if we introduce some grouping attribute (in my case - folder path like Debian/<package>/<version>) so those jobs could be grouped in the common dash board. I kinda like that idea now, will elaborate on it later.

vsoch commented 4 years ago

That's an interesting idea! There isn't really established yet any concept of a "qme instance" but it might be what would be warranted for some executor to launch and then actively monitor some scoped thing. Looking forward to hearing your design / implementation for this!

vsoch / qme

[wishlist] NeuroDebian use case and inspired features: "local" (to folder) qme, "local" qme as a job for "larger" qme, etc #42

Description

What I have loved to be able to do