Feature Req: Ability to pass in Spark Conf settings when creating Kernels

mariobriggs commented 7 years ago

Today, the call to create a kernel does not allow to set Spark Conf setting. Therefore one cannot initialize user defined Spark conf properties. Here is an example

We provide some functionality via a lib that needs to be used both in spark-submit and interactive gateway. Here is how we use we have to use it from both

spark-submit
sparkSubmit.sh blah blah —conf spark.service.user.xxx.yyy.cred=demo blah blah blah

Kernel Gateway System.setProperty("spark.service.user.xxx.yyy.cred","demo")

The net is that you cannot add props to SparkContext after it is initialized i.e. downstream readers will not see it (some sparksql props are an exception)

this also means that our internal impl needs to handle system prop & conf and luckily in our case, our code only runs on driver so system.setProperty() will work. If the code needs to run on the executor, the system props would not get propagated to the executor and the user is blocked.

Therefore i think kernel creation in kernel gateway API, should allow to set user-specific sparkConf args

parente commented 7 years ago

The main way to to pass information to kernels in Jupyter is to define custom command line arguments or environments variables in the kernel spec file. For example, SPARK_OPTS, SPARK_HOME, etc. can be set in a kernel.json for Apache Toree or IPython.

Does this fit your use case?

mariobriggs commented 7 years ago

@parente no that does not help my case. Here i am looking to set per kernel instance, user specified properties like a kakfa port, creds etc

parente commented 7 years ago

It's also possible to include environment variables in POST /api/kernels which will be set in the environment of the kernel when it spawns.

https://github.com/jupyter/kernel_gateway/issues/128 https://github.com/jupyter/kernel_gateway/blob/master/kernel_gateway/jupyter_websocket/swagger.yaml#L403

At present, only env vars prefixed with KG_ are whitelisted to avoid malicious requests that set PATH or other key env vars. With the support as implement, a client can send an arbitrary env var and then execute arbitrary code on the kernel that reads those env vars. The code executed on the kernel could read KG_SPARK_OPTS, KG_KAFKA_ENDPOINT, etc. from the environment and use their values to properly build, for example, a SparkContext.

Here i am looking to set per kernel instance, user specified properties like a kakfa port, creds etc

So the parameters will vary per kernel instance, but they're all for a single user, correct? (Kernel gateway is fundamentally still a single-user service which needs to be scaled to support multiple users.)

mariobriggs commented 7 years ago

@parente for instance the default spark-shell (scala) and the pyspark shell allow to pass in a '--conf PROP=value' . These are what i want to set

bin/pyspark --help ... --conf PROP=VALUE Arbitrary Spark configuration property.

Similarly Toree allows to set a SPARK_OPTS which can contain the '--conf'. For the toree case, i am not sure if you are saying hit the 'KernelSpec' endpoint first to set that ?

parente commented 7 years ago

Sorry for the confusion. I'm suggesting ways that kernel gateway supports of passing information to kernels at startup time to see if any of them can be used to solve your case. Adding a spark specific config option is what I'm trying to avoid since KG is agnostic about what libs you're going to use and run in your kernel.

With the existing env support when making a POST /api/kernels request, you can pass environment variables that will be set in the environment of a kernel. You can conceivably send code to the kernel to read these environment variables and do whatever you need to with their values (e.g., programmatically configure a SparkContext).

mariobriggs commented 7 years ago

You can conceivably send code to the kernel to read these environment variables and do whatever you need to with their values (e.g., programmatically configure a SparkContext). <<

No option to inject any code before SparkContext is initialized. Also why would end-user go through such hoops

parente commented 7 years ago

No option to inject any code before SparkContext is initialized. Also why would end-user go through such hoops

The IPython and R kernels do not initialize a SparkContext on their own. What code is initializing the SparkContext in these cases? Can that code also read env vars to configure the context properly?

mariobriggs commented 7 years ago

https://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession

'If you are working from the sparkR shell, the SparkSession should already be created for you, and you would not need to call sparkR.session.'

As far as i can tell even if some shell didnt, then the user has to execute code that does that.

I have already told you that the way shells handle this is via '--conf PROP=VALUE' , so i am not sure why we are going in circles here

parente commented 7 years ago

the way shells handle this is via '--conf PROP=VALUE'

We can certainly figure out how the kernel gateway could pass command line args to the kernels on launch in some dynamic fashion. But I'm not clear on what happens next. The kernels, which are entirely separate projects from the kernel gateway one, aren't aware of those arguments. Do you have a proposal for how the kernels would then use those arguments to initialize the Spark config?

mariobriggs commented 7 years ago

@parente - "It's also possible to include environment variables in POST /api/kernels which will be set in the environment of the kernel when it spawns."

Can you share an example of doing above... i am using this codehttps://github.com/jupyter/kernel_gateway_demos/blob/master/python_client_example/src/client.py#L36

parente commented 7 years ago

You can optionally pass an env object as part of the POST body. Extending the python_client_example:

response = yield client.fetch(
    '{}/api/kernels'.format(base_url),
    method='POST',
    auth_username='fakeuser',
    auth_password='fakepass',
    body=json_encode({
        'name' : options.lang,
        'env': {
            'KG_SOME_KEY': 'string value',
            'KG_MY_ENV_VAR': 'other value'
        }
    })
)

I mentioned above that only env vars that start with KG_* are whitelisted at the moment to prevent PATH, PYTHONPATH, and other sensitive env var overriding. If you want to pass SPARK_OPTS, for example, we probably need to extend the whitelist rule.

References

Swagger spec: https://github.com/jupyter/kernel_gateway/blob/master/kernel_gateway/jupyter_websocket/swagger.yaml#L403
Original Issue: #128

mariobriggs commented 7 years ago

@parente Thanks.

I didnt get time to follow through, but picking this up again now. So will try it out and get back.

If you want to pass SPARK_OPTS, for example, we probably need to extend the whitelist rule.

Yes please. Would appreciate if you could

ckadner commented 7 years ago

@parente -- in your comment above from Nov 23, 2016 you are referring to KG_ prefixed environment variables being whitelisted:

... env vars that start with KG_* are whitelisted ...

Did you intend to write this instead?

... env vars that start with KERNEL_* are whitelisted ...

And should your example look like this instead?

response = yield client.fetch(
    '{}/api/kernels'.format(base_url),
    method='POST',
    auth_username='fakeuser',
    auth_password='fakepass',
    body=json_encode({
        'name' : options.lang,
        'env': {
            'KERNEL_SOME_KEY': 'string value',
            'KERNEL_MY_ENV_VAR': 'other value'
        }
    })
)

see: kernel_gateway/services/kernels/handlers.py#L56

            # Whitelist KERNEL_* args and those allowed by configuration
            env = {key: value for key, value in model['env'].items()
                   if key.startswith('KERNEL_') or key in self.env_whitelist}

parente commented 7 years ago

@ckadner Yes indeed. Thanks for the catch.

ricedavida commented 7 years ago

I am trying to use the nb2kg extension. The way this extension passes info to KG is through headers. Is there a way to create env variables based on these headers?

parente commented 7 years ago

I just merged a PR on the nb2kg repo that I think does what you need.

jupyter-server / kernel_gateway

Feature Req: Ability to pass in Spark Conf settings when creating Kernels #200