Rethinking environment handling

rkdarst commented 5 years ago

I think we need to rethink environment handling:

Currently there is req_keepvars which sets {{keepvars}} in the batch script to a comma separated list, but does not add them to the environment itself. The environment which is gotten from .get_env() which comes from JupyterHub, and JupyterHub gives only what is needed.

Problems:

There is no way to specify kept environment variables without specifying everything again
req_keepvars doesn't add new variables to the env dict which is passed to spawners
The same environment is passed to the batch commands as is passed to the single-user notebook server.

I am working on rewrite (started from #119 long ago) to improve things. But it's become clear that we need a new mental model of how the variables even work.

Here is what I propose:

a concept of "user environment" which comes from .get_env() (get_env is a JH concept)
- controlled with req_keepvars, which is only the extra variables which should be added, not the defalut.
- req_keepvars_default which take the place of old req_keepvars. Or this can just be left out (my first implementation will have it to avoid too many problems.
a concept of "admin environment" which is sent to the batch commands
- by default is everything in .get_env()
- controlled by admin_environment which takes the same form as req_keepvars
.run_command() (which executes the batch commands, poll commands, cancel commands, and so on) always gets the admin environment
It is up to the batch scripts to be able to use {{keepvars}} to limit what the user has access to.

What do you think?

mbmilligan commented 5 years ago

Right. I think we ended up getting these mental models muddled together when adding req_keepvars. What is actually does is very limited: for batch systems that support this, it names environment variables that should be propagated from the submission environment into the execution environment. Without additional code, that's only useful for getting the JHub API key into the server session. But there seem to be various people hacking in similar things to pass through other kinds of access tokens.

I think your mental model outlined makes sense, but can we get some contributors to throw in concrete use cases so we're not just talking abstractly? @cmd-ntrf ? @zonca ? @jupyterhub/designers ?

zonca commented 5 years ago

I think this is a good idea. I always had trouble getting the right environment into Comet, last time I deployed I worked around this explicitly propagating the environment via ssh, see https://gist.github.com/zonca/55f7949983e56088186e99db53548ded#file-spawner-py-L55 (from https://zonca.github.io/2018/11/jupyterhub-supercomputer.html)

rkdarst commented 5 years ago

I implemented this in resurrected PR #119

It's a bit annoying that this changes semantics yet again, but I think it's better to get it done before the release.

Next up: we could add all_envs to the subvars which is the raw dictionary. Then, one can do something in the batch script like this (but we should not recommend this, storing the API token in the batch script is not secure with some batch systems!):

{% for key, value in all_env.items() %}
export {{key}}={{value}}
{% endfor %}

... maybe for security this just shouldn't be added, since it will just lead to people doing the above - or something similar like leaking tokens by /proc/*/cmdline...

jupyterhub / batchspawner

Rethinking environment handling #160