Closed mariobriggs closed 7 years ago
The main way to to pass information to kernels in Jupyter is to define custom command line arguments or environments variables in the kernel spec file. For example, SPARK_OPTS
, SPARK_HOME
, etc. can be set in a kernel.json for Apache Toree or IPython.
Does this fit your use case?
@parente no that does not help my case. Here i am looking to set per kernel instance, user specified properties like a kakfa port, creds etc
It's also possible to include environment variables in POST /api/kernels
which will be set in the environment of the kernel when it spawns.
https://github.com/jupyter/kernel_gateway/issues/128 https://github.com/jupyter/kernel_gateway/blob/master/kernel_gateway/jupyter_websocket/swagger.yaml#L403
At present, only env vars prefixed with KG_ are whitelisted to avoid malicious requests that set PATH
or other key env vars. With the support as implement, a client can send an arbitrary env var and then execute arbitrary code on the kernel that reads those env vars. The code executed on the kernel could read KG_SPARK_OPTS
, KG_KAFKA_ENDPOINT
, etc. from the environment and use their values to properly build, for example, a SparkContext.
Here i am looking to set per kernel instance, user specified properties like a kakfa port, creds etc
So the parameters will vary per kernel instance, but they're all for a single user, correct? (Kernel gateway is fundamentally still a single-user service which needs to be scaled to support multiple users.)
@parente for instance the default spark-shell (scala) and the pyspark shell allow to pass in a '--conf PROP=value' . These are what i want to set
bin/pyspark --help ... --conf PROP=VALUE Arbitrary Spark configuration property.
Similarly Toree allows to set a SPARK_OPTS which can contain the '--conf'. For the toree case, i am not sure if you are saying hit the 'KernelSpec' endpoint first to set that ?
Sorry for the confusion. I'm suggesting ways that kernel gateway supports of passing information to kernels at startup time to see if any of them can be used to solve your case. Adding a spark specific config option is what I'm trying to avoid since KG is agnostic about what libs you're going to use and run in your kernel.
With the existing env
support when making a POST /api/kernels
request, you can pass environment variables that will be set in the environment of a kernel. You can conceivably send code to the kernel to read these environment variables and do whatever you need to with their values (e.g., programmatically configure a SparkContext).
You can conceivably send code to the kernel to read these environment variables and do whatever you need to with their values (e.g., programmatically configure a SparkContext). <<
No option to inject any code before SparkContext is initialized. Also why would end-user go through such hoops
No option to inject any code before SparkContext is initialized. Also why would end-user go through such hoops
The IPython and R kernels do not initialize a SparkContext on their own. What code is initializing the SparkContext in these cases? Can that code also read env vars to configure the context properly?
https://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession
'If you are working from the sparkR shell, the SparkSession should already be created for you, and you would not need to call sparkR.session.'
As far as i can tell even if some shell didnt, then the user has to execute code that does that.
I have already told you that the way shells handle this is via '--conf PROP=VALUE' , so i am not sure why we are going in circles here
the way shells handle this is via '--conf PROP=VALUE'
We can certainly figure out how the kernel gateway could pass command line args to the kernels on launch in some dynamic fashion. But I'm not clear on what happens next. The kernels, which are entirely separate projects from the kernel gateway one, aren't aware of those arguments. Do you have a proposal for how the kernels would then use those arguments to initialize the Spark config?
@parente - "It's also possible to include environment variables in POST /api/kernels which will be set in the environment of the kernel when it spawns."
Can you share an example of doing above... i am using this codehttps://github.com/jupyter/kernel_gateway_demos/blob/master/python_client_example/src/client.py#L36
You can optionally pass an env
object as part of the POST body. Extending the python_client_example:
response = yield client.fetch(
'{}/api/kernels'.format(base_url),
method='POST',
auth_username='fakeuser',
auth_password='fakepass',
body=json_encode({
'name' : options.lang,
'env': {
'KG_SOME_KEY': 'string value',
'KG_MY_ENV_VAR': 'other value'
}
})
)
I mentioned above that only env vars that start with KG_*
are whitelisted at the moment to prevent PATH, PYTHONPATH, and other sensitive env var overriding. If you want to pass SPARK_OPTS, for example, we probably need to extend the whitelist rule.
References
@parente Thanks.
I didnt get time to follow through, but picking this up again now. So will try it out and get back.
If you want to pass SPARK_OPTS, for example, we probably need to extend the whitelist rule.
Yes please. Would appreciate if you could
@parente -- in your comment above from Nov 23, 2016 you are referring to KG_
prefixed environment variables being whitelisted:
... env vars that start with KG_* are whitelisted ...
Did you intend to write this instead?
... env vars that start with KERNEL_* are whitelisted ...
And should your example look like this instead?
response = yield client.fetch(
'{}/api/kernels'.format(base_url),
method='POST',
auth_username='fakeuser',
auth_password='fakepass',
body=json_encode({
'name' : options.lang,
'env': {
'KERNEL_SOME_KEY': 'string value',
'KERNEL_MY_ENV_VAR': 'other value'
}
})
)
see: kernel_gateway/services/kernels/handlers.py#L56
# Whitelist KERNEL_* args and those allowed by configuration
env = {key: value for key, value in model['env'].items()
if key.startswith('KERNEL_') or key in self.env_whitelist}
@ckadner Yes indeed. Thanks for the catch.
I am trying to use the nb2kg extension. The way this extension passes info to KG is through headers. Is there a way to create env variables based on these headers?
I just merged a PR on the nb2kg repo that I think does what you need.
Today, the call to create a kernel does not allow to set Spark Conf setting. Therefore one cannot initialize user defined Spark conf properties. Here is an example
We provide some functionality via a lib that needs to be used both in spark-submit and interactive gateway. Here is how we use we have to use it from both
spark-submit
sparkSubmit.sh blah blah —conf spark.service.user.xxx.yyy.cred=demo blah blah blah
Kernel Gateway System.setProperty("spark.service.user.xxx.yyy.cred","demo")
The net is that you cannot add props to SparkContext after it is initialized i.e. downstream readers will not see it (some sparksql props are an exception)
this also means that our internal impl needs to handle system prop & conf and luckily in our case, our code only runs on driver so system.setProperty() will work. If the code needs to run on the executor, the system props would not get propagated to the executor and the user is blocked.
Therefore i think kernel creation in kernel gateway API, should allow to set user-specific sparkConf args