jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

[idea] batchspawner sprint #138

Closed cmd-ntrf closed 5 years ago

cmd-ntrf commented 5 years ago

With the coming release of JupyterHub 1.0, I would suggest we organize a small code sprint during the next Jupyter for Science User Facilities and High Performance Computing Workshop. It is happening from June 11 to 13.

I am not sure who is participating, but I think @mbmilligan you are going? Anyone else would also be interested in helping solving a few issues and pave the way to release batchspawner 1.0?

rcthomas commented 5 years ago

I was really hoping for this. Creating a list of topics for the hack sessions would be really useful.

cbjhnsn commented 5 years ago

I probably won't make the event but if this is something that happens I have some feedback I would like to give for adjustments.

rkdarst commented 5 years ago

+1. I guess we should start by making wishlist issues about the things we would like to do, and someone could tag them with sprint. Or if they are too small to warrant own issues, add them here.

rcthomas commented 5 years ago

Here are some thoughts I've had. The first set relate more to batchspawner than the others...

Other things less related to batchspawner but still relevant to the sprint context...

mbmilligan commented 5 years ago

Thanks all for getting the conversation started on this! I think if we can get enough batchspawner-interested parties together for the sprint session, this would be a great use of the time. I will certainly be there.

My priorities would probably be (subject to change, in no particular order):

Additionally, some discussions that I'd like to have, that might go faster face-to-face than over Github threads:

For that matter, I'm happy to facilitate setting up some web meeting for Batchspawner folks ahead of the workshop, too, if there would be interest in having some of this discussion in advance of the meeting.

rkdarst commented 5 years ago

I think we could drop 3.4 support... that would enable supporting progress updates (unless someone more clever than me can fix it). Backwards compatibility is nice, but if it's that or ongoing development, I'd prefer ongoing development.

ProfileSpawner's functionality is also included in KubeSpawner - perhaps they could be unified? Low priority, though.

One high priority would be working Travis tests.

There are so many open pull requests that I don't know what to do... many are related or directly depend on each other, my local commit graph is getting too complicated to be worth looking at (many branches aren't PRs yet because of this). Some issues relate to ones in profile/wrap spawner, sometimes the same thing solved several times. I haven't had time to do much lately, but not having a direction is slowing things down woo. I'll have time to do stuff this summer, if there's a path forward.

We should do something about the jupyterlab command (#141). Preferably, find a way to do this that doesn't require new commands, but I'm not expert enough in this.

Get the "select port on remote host" worked out a bit more. I think part can be sent to jupyterhub. It also related sto the above about custom jupyterhub command - both of these require a new command which makes a manual-maintnance situation. (I could be wrong here, going from memory...)

Creating a jupyter on HPC guide would be great... I really should get around to writing up what I've done. It's pretty similar to what mbmilligan does from what I understand.

I can take part in a remote sprint for the rest of this week. Let me know what I should be working on.

rkdarst commented 5 years ago

Here's my classification of issues. If it's in the same bullet point, it's either duplicate issues or duplicate PRs that are all solved by mostly one solution. Feel free to edit this comment to organize better, I did this quickly.

High priority:

Medium:

Low:

Easy (don't have difficult interactions with other things, can be any category above. May need thought though.):

Support:

mbmilligan commented 5 years ago

Hello all,

While we have not found much time for hacking here at the JCW, we have had some very productive face-to-face discussions about the issues holding back Batchspawner. To summarize:

  1. Proposal: Having gotten no pushback from the people here, we would like to propose narrowing Batchspawner's backward-compatibility goals. This will allow us to simplify a number of complicated/redundant code paths, and dramatically simplify and speed up the CI testing process. Specifically we suggest:
    • Supported Python versions shall be limited to versions supported by the latest released version of Jupyterhub. At this time Jupyterhub 1.0 supports Python 3.5 and newer, so we will immediately stop testing against/avoiding features that would exclude Py3.4.
    • Supported Jupyterhub versions shall be limited to the newest released version and one earlier major version. At this time that would be Jupyterhub 1.0 and the 0.9 branch, so we will immediately stop testing against/maintaining support for Jupyterhub 0.8 and earlier.
  2. Priorities: We will focus effort on issuing a new release that
    • Resolves the issues around remote port selection, including compat issues with Wrapspawner
    • Adopts the wrapper model in #141
    • Supports Jupyterhub's new end-to-end SSL functionality (enabled by #141)
    • Supports Jupyterhub 1.0 (anecdotally reported to work now)
  3. Testing: Felix (@cmd-ntrf) demonstrated a method for spinning up a transient SLURM cluster with Jupyterhub/Batchspawner. We want to use this approach to implement all-up integration testing for Batchspawner and Batchspawner/Wrapspawner configurations. This will allow us to focus the CI testing on unit and component tests, while also detecting breakage that is currently hard to identify with CI testing alone.

Other results from our discussions so far:

Please consider this an open call for comments on these plans and proposals. Thank you!

cbjhnsn commented 5 years ago

For some schedulers (looking at you, schedMD) querying the job queue is not an especially cheap operation. Many active user sessions, each causing the Hub to issue a periodic poll(), cause problems for these systems. We should implement an optional schedule polling helper (probably running as a Hub service) that polls a queue and caches the result, and an easy way to retarget our poll() methods to hit the helper instead of the normal tool. >

This would help us and something that I've thought about trying to add. For two reasons the current poll method is costly for us our scheduler (SGE) only updates every 30 seconds so there is no reason to poll any faster and as mentioned above its quite expensive and excessive polling has been known to cause performance issues for us.

rkdarst commented 5 years ago

Agree on reduction of backwards compatibility - if it's easy for future releases, we can keep 3.5/0.9 for longer in the future.

Progress bar - agreed it may be hard to use, but is important to have that possibility. Likely it will be used for more informational messages.

I'll see what I can do on a merge strategy...

mbmilligan commented 5 years ago

Some further notes:

rkdarst commented 5 years ago

I was wondering, did you talk about maintainers and how we can get ensure maintenance keeps moving, both strategic questions on what's good and dealing with PRs?

rcthomas commented 5 years ago

A little, but we mainly agreed to start a monthly zoom check-in around "batchspawner and friends" to help address those kinds of questions. I'll be in touch with everyone on this about setting that up. It may not need to be a perpetual meeting but we could run it as long as we felt was necessary.

rkdarst commented 5 years ago

Monthly meetings would be good, hopefully they would help things go faster. But does that mean there will be only once a month merge times? If we all commented on issues and PRs a bit more, we could figure out what to do. I am going through old PRs and most can be merged now with a little bit of work, once we have a master to test against. It's always better for stuff to be integrated fast so that we can test interactions, I'd rather not have to keep separate integration branches around for personal testing. I'm happy to do whatever is needed, just let me know.

rcthomas commented 5 years ago

I don't think that implies once a month merge times. Unless it was decided that was a good idea. A monthly call could entail a check-in on any outstanding PRs and issues and that could help move things along.

rkdarst commented 5 years ago

Yeah, I misstated in my comment, I didn't mean to imply that I thought monthly check-ups was opposed to continuous work. It would have been better to say: "is there any plan to try to get stuff reviewed and accepted faster outside of those meetings?". I've gone through most outstanding PRs and the vast majority can be fixend and merged or closed, but we need some of the basic ones handled first. The other options don't seem so good...

mbmilligan commented 5 years ago

So far as we discussed, I will continue to be responsible for merges to the official repository. Between our decisions about reduced requirements for back compatibility, and the regular opportunity to discuss strategy and blocking issues, the hope is that reviewing PRs will be considerably less arduous going forward.

More concretely, once I am back in the office, I will be dedicating significant time this week to using the guidance from the workshop to work through our backlog. After the backlog is dealt with, I hope we can settle into a roughly weekly merge cadence.

rkdarst commented 5 years ago

That would be great - let me know what I can do. For my part I'll try to comment on things more and do initial reviews. I have a large and a mega PR that integrate most of the other important PRs, which at least works for me in my test system. Not everything is perfect, but at this point it's so complex that I'd rather it stabilize, then check for integration problems.

For several of the PRs, I have some relevant integration/forward porting in the comments. For others, the only integration is in my integration branch, but if #143 is accepted then it's possible to work on those individually.

Let me know what else I can do to help.