[idea] batchspawner sprint

cmd-ntrf commented 5 years ago

With the coming release of JupyterHub 1.0, I would suggest we organize a small code sprint during the next Jupyter for Science User Facilities and High Performance Computing Workshop. It is happening from June 11 to 13.

I am not sure who is participating, but I think @mbmilligan you are going? Anyone else would also be interested in helping solving a few issues and pave the way to release batchspawner 1.0?

rcthomas commented 5 years ago

I was really hoping for this. Creating a list of topics for the hack sessions would be really useful.

cbjhnsn commented 5 years ago

I probably won't make the event but if this is something that happens I have some feedback I would like to give for adjustments.

rkdarst commented 5 years ago

+1. I guess we should start by making wishlist issues about the things we would like to do, and someone could tag them with sprint. Or if they are too small to warrant own issues, add them here.

rcthomas commented 5 years ago

Here are some thoughts I've had. The first set relate more to batchspawner than the others...

JupyterHub 1 supports end-to-end SSL to the notebook. For this to work, .create_certs() and .move_certs() need to be implemented on custom spawners. It might be interesting to look at whether there's a common solution for these for all of us batchspawner users. Here is the merged PR from @tgmachina (who will be with us).
Those of us using batchspawner should probably put some time into documenting how we've done our setups, and the interaction with wrapspawner especially.

Other things less related to batchspawner but still relevant to the sprint context...

At our Lab we have a custom solution that will enable users (currently Lab extension developers) to select the jupyter-labhub they want to run at launch time. It's a little wrapper script that we tell them to put at a magic place with the right command to run, and if that script is executable, it will run that instead of whatever the Hub is configured to do. @cmd-ntrf's customization with the app wrapper for sending back the port info makes me think that kind of thing would be better. The main thing this would get us would be allowing users to manage extensions themselves.
The question of monitoring and audit trails has come up at least at our Lab. Some of this we can handle in our infrastructure, but some investigation of how to do more thorough logging of the notebook and terminal might be a good use of time. An example PR trying to address some of this at least is here..

mbmilligan commented 5 years ago

Thanks all for getting the conversation started on this! I think if we can get enough batchspawner-interested parties together for the sprint session, this would be a great use of the time. I will certainly be there.

My priorities would probably be (subject to change, in no particular order):

Ensure Batchspawner and Wrapspawner are compatible with JupyterHub 1 (without breaking too much backward compat) and each other
Ensure both are compatible with the new SSL features
Update PyPI packaging for both
Update testing for Batchspawner, and implement testing for Wrapspawner

Additionally, some discussions that I'd like to have, that might go faster face-to-face than over Github threads:

What should our backwards-compat cutoffs actually be, based on the needs of actual users?
Are there changes to the Github workflow or testing infrastructure that would accelerate the incorporation of changes?
Is there significant technical debt that needs addressing?

For that matter, I'm happy to facilitate setting up some web meeting for Batchspawner folks ahead of the workshop, too, if there would be interest in having some of this discussion in advance of the meeting.

rkdarst commented 5 years ago

I think we could drop 3.4 support... that would enable supporting progress updates (unless someone more clever than me can fix it). Backwards compatibility is nice, but if it's that or ongoing development, I'd prefer ongoing development.

ProfileSpawner's functionality is also included in KubeSpawner - perhaps they could be unified? Low priority, though.

One high priority would be working Travis tests.

There are so many open pull requests that I don't know what to do... many are related or directly depend on each other, my local commit graph is getting too complicated to be worth looking at (many branches aren't PRs yet because of this). Some issues relate to ones in profile/wrap spawner, sometimes the same thing solved several times. I haven't had time to do much lately, but not having a direction is slowing things down woo. I'll have time to do stuff this summer, if there's a path forward.

We should do something about the jupyterlab command (#141). Preferably, find a way to do this that doesn't require new commands, but I'm not expert enough in this.

Get the "select port on remote host" worked out a bit more. I think part can be sent to jupyterhub. It also related sto the above about custom jupyterhub command - both of these require a new command which makes a manual-maintnance situation. (I could be wrong here, going from memory...)

Creating a jupyter on HPC guide would be great... I really should get around to writing up what I've done. It's pretty similar to what mbmilligan does from what I understand.

I can take part in a remote sprint for the rest of this week. Let me know what I should be working on.

rkdarst commented 5 years ago

Here's my classification of issues. If it's in the same bullet point, it's either duplicate issues or duplicate PRs that are all solved by mostly one solution. Feel free to edit this comment to organize better, I did this quickly.

High priority:

141 (PR, batchspawner-singleuser and jupyterlab), #133 (PR, same thing different implementation)
selecting port remotely: #139 (PR), jupyterhub/wrapspawner#27 (PR, same), jupyterhub/wrapspawner#25(PR, same), #127 (issue), #132 (PR), #130 (docs), #129 (bug report, probably related), #126 (issue), #114 (PR, other implementation), #102 (issue). Most are follow-up to #58 (PR, merged).

Medium:

Progess method and reporting: #121 (issue), #120 (issue, UI warning when not enough resources), #90 (PR), #86 (PR), #81 (issue), jupyterhub/wrapspawner#22(PR), jupyterhub/wrapspawner#22(PR, same). We need someone smarter than me, or no python 3.4 anymore.

Low:

119 (refactor environment handling, discuss first, I may need to update)
82 (security of environment)

Easy (don't have difficult interactions with other things, can be any category above. May need thought though.):

135 and #136 (OAR spawner)
140 (readme typos)
137 (grid engine env vars)

Support:

123
118 (LSF)
117 (can be closed I think)
98 (can be closed)
35 (grid engine paraller environments)

mbmilligan commented 5 years ago

Hello all,

While we have not found much time for hacking here at the JCW, we have had some very productive face-to-face discussions about the issues holding back Batchspawner. To summarize:

Proposal: Having gotten no pushback from the people here, we would like to propose narrowing Batchspawner's backward-compatibility goals. This will allow us to simplify a number of complicated/redundant code paths, and dramatically simplify and speed up the CI testing process. Specifically we suggest:
- Supported Python versions shall be limited to versions supported by the latest released version of Jupyterhub. At this time Jupyterhub 1.0 supports Python 3.5 and newer, so we will immediately stop testing against/avoiding features that would exclude Py3.4.
- Supported Jupyterhub versions shall be limited to the newest released version and one earlier major version. At this time that would be Jupyterhub 1.0 and the 0.9 branch, so we will immediately stop testing against/maintaining support for Jupyterhub 0.8 and earlier.
Priorities: We will focus effort on issuing a new release that
- Resolves the issues around remote port selection, including compat issues with Wrapspawner
- Adopts the wrapper model in #141
- Supports Jupyterhub's new end-to-end SSL functionality (enabled by #141)
- Supports Jupyterhub 1.0 (anecdotally reported to work now)
Testing: Felix (@cmd-ntrf) demonstrated a method for spinning up a transient SLURM cluster with Jupyterhub/Batchspawner. We want to use this approach to implement all-up integration testing for Batchspawner and Batchspawner/Wrapspawner configurations. This will allow us to focus the CI testing on unit and component tests, while also detecting breakage that is currently hard to identify with CI testing alone.

Other results from our discussions so far:

While dropping Py3.4 support will allow us to implement the progress bar feature, nobody here was enthusiastic about using it. The underlying issue is that batch systems do not give enough information to usefully estimate when a job will start. However, there is considerable interest in bubbling informational messages up to the user (e.g. job information, notice that resource constraints are wrong, etc). This works now in the case of fatal errors (exception messages are printed) but informational messages need work.
Remote port selection via API call is a natural feature to add to the Hub. We want to prepare a PR for Jupyterhub proposing this.
More generally, the Jupyterhub extension API should provide a way for spawners to add request handlers to exchange information with spawned tasks. We should open a discussion with the Jupyterhub core developers about this.
For some schedulers (looking at you, schedMD) querying the job queue is not an especially cheap operation. Many active user sessions, each causing the Hub to issue a periodic poll(), cause problems for these systems. We should implement an optional schedule polling helper (probably running as a Hub service) that polls a queue and caches the result, and an easy way to retarget our poll() methods to hit the helper instead of the normal tool.
Relatedly, we should investigate if it is possible for a spawner instance to reduce the polling frequency once a job has started. If so, we should provide that as an option and enable it by default.

Please consider this an open call for comments on these plans and proposals. Thank you!

cbjhnsn commented 5 years ago

For some schedulers (looking at you, schedMD) querying the job queue is not an especially cheap operation. Many active user sessions, each causing the Hub to issue a periodic poll(), cause problems for these systems. We should implement an optional schedule polling helper (probably running as a Hub service) that polls a queue and caches the result, and an easy way to retarget our poll() methods to hit the helper instead of the normal tool. >

This would help us and something that I've thought about trying to add. For two reasons the current poll method is costly for us our scheduler (SGE) only updates every 30 seconds so there is no reason to poll any faster and as mentioned above its quite expensive and excessive polling has been known to cause performance issues for us.

rkdarst commented 5 years ago

Agree on reduction of backwards compatibility - if it's easy for future releases, we can keep 3.5/0.9 for longer in the future.

Progress bar - agreed it may be hard to use, but is important to have that possibility. Likely it will be used for more informational messages.

I'll see what I can do on a merge strategy...

mbmilligan commented 5 years ago

Some further notes:

@yuvipanda is very supportive of adding a port-setting API to core Jupyterhub, and we are requested to submit an issue proposing that. I will write that up shortly. @minrk is reportedly averse to giving spawners a general-purpose mechanism to add extension API request handlers, because chaos will ensue. So if we want to do that, we'll have to keep our hacky api import approach.
We should take a look at Kubespawner, which already has a mechanism for reflecting poll() calls to a cacheing layer.

rkdarst commented 5 years ago

I was wondering, did you talk about maintainers and how we can get ensure maintenance keeps moving, both strategic questions on what's good and dealing with PRs?

rcthomas commented 5 years ago

A little, but we mainly agreed to start a monthly zoom check-in around "batchspawner and friends" to help address those kinds of questions. I'll be in touch with everyone on this about setting that up. It may not need to be a perpetual meeting but we could run it as long as we felt was necessary.

rkdarst commented 5 years ago

Monthly meetings would be good, hopefully they would help things go faster. But does that mean there will be only once a month merge times? If we all commented on issues and PRs a bit more, we could figure out what to do. I am going through old PRs and most can be merged now with a little bit of work, once we have a master to test against. It's always better for stuff to be integrated fast so that we can test interactions, I'd rather not have to keep separate integration branches around for personal testing. I'm happy to do whatever is needed, just let me know.

rcthomas commented 5 years ago

I don't think that implies once a month merge times. Unless it was decided that was a good idea. A monthly call could entail a check-in on any outstanding PRs and issues and that could help move things along.

rkdarst commented 5 years ago

Yeah, I misstated in my comment, I didn't mean to imply that I thought monthly check-ups was opposed to continuous work. It would have been better to say: "is there any plan to try to get stuff reviewed and accepted faster outside of those meetings?". I've gone through most outstanding PRs and the vast majority can be fixend and merged or closed, but we need some of the basic ones handled first. The other options don't seem so good...

mbmilligan commented 5 years ago

So far as we discussed, I will continue to be responsible for merges to the official repository. Between our decisions about reduced requirements for back compatibility, and the regular opportunity to discuss strategy and blocking issues, the hope is that reviewing PRs will be considerably less arduous going forward.

More concretely, once I am back in the office, I will be dedicating significant time this week to using the guidance from the workshop to work through our backlog. After the backlog is dealt with, I hope we can settle into a roughly weekly merge cadence.

rkdarst commented 5 years ago

That would be great - let me know what I can do. For my part I'll try to comment on things more and do initial reviews. I have a large and a mega PR that integrate most of the other important PRs, which at least works for me in my test system. Not everything is perfect, but at this point it's so complex that I'd rather it stabilize, then check for integration problems.

For several of the PRs, I have some relevant integration/forward porting in the comments. For others, the only integration is in my integration branch, but if #143 is accepted then it's possible to work on those individually.

Let me know what else I can do to help.

jupyterhub / batchspawner

[idea] batchspawner sprint #138

141 (PR, batchspawner-singleuser and jupyterlab), #133 (PR, same thing different implementation)

119 (refactor environment handling, discuss first, I may need to update)

82 (security of environment)

135 and #136 (OAR spawner)

140 (readme typos)

137 (grid engine env vars)

123

118 (LSF)

117 (can be closed I think)

98 (can be closed)

35 (grid engine paraller environments)