twosigma / beakerx

Beaker Extensions for Jupyter Notebook
http://BeakerX.com
Apache License 2.0
2.8k stars 382 forks source link

spark magic should support SPARK_OPTS envar #7440

Open scottdraves opened 6 years ago

scottdraves commented 6 years ago

has options for spark_submit, https://spark.apache.org/docs/2.3.0/submitting-applications.html eg: https://github.com/apache/incubator-toree/search?q=spark_opts&unscoped_q=spark_opts

scottdraves commented 6 years ago

So it should not be hard to parse this variable and fill out the spark config form with its values. But it introduces a complication that make me wonder if there is a better way.

Here’s what I mean. The values from the Spark UI are also stored in ~/.jupyter/beakerx.json file so they persist. So if you use SPARK_OPTS to set some values, they appear in the UI. And they get saved in beaker.json. Then next run, they will appear in both places. So any further change the envar will be ignored because the beaker.json will override it, or the other way around. Either one is confusing.

The only way I can think of working around this is to make values from the envar not editable in the UI, which seems annoying.

So I think having two sources for the configs is a problem.

I could not find documentation for SPARK_OPTS in spark itself. Is this envar official? Could users instead set up their configs via the beakerx.json file? We only just (today) provided this option. For example here is what my beaker.json file now looks like:

{
  "beakerx": {
    "jvm_options": {
      "other": [],
      "properties": []
    },
    "ui_options": {
      "auto_close": false,
      "auto_save": false,
      "improve_fonts": true,
      "show_publication": true,
      "use_data_grid": true,
      "wide_cells": false
    },
    "version": 2.0,
    "spark_options": {
      "spark.executor.memory": "8g",
      "spark.master": "local[*]",
      "spark.driver.maxResultSize": "2g",
      "spark.executor.cores": "11"
    }
  }
}

The file has the advantage over SPARK_OPTS that it is writable. If we really wanted we could provide a cmd line to set the file from the envar:

beakerx copy_spark_config

or something. that would make it easy for people who had code using the envar to get access to the values but avoid the complications by making it an explicit one-off.

jarutis commented 6 years ago

I missed this feature when evaluating BeakerX as a notebook for Spark cluster a few weeks ago. Storing config in beakerx.json is fine for us. I believe the standard way is to put config options in $SPARK_HOME/conf/spark-defaults.conf file.

scottdraves commented 6 years ago

@jarutis also writes:

for example https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json or https://github.com/apache/zeppelin/blob/master/spark/interpreter/src/main/resources/interpreter-setting.json or https://github.com/spark-notebook/spark-notebook/blob/master/conf/profiles

vishnu2kmohan commented 6 years ago

Then next run, they will appear in both places. So any further change the envar will be ignored because the beaker.json will override it, or the other way around. Either one is confusing.

Hmmn doesn't (sane, 12-factor) convention state that environment variables override (persistent) configuration files? If SPARK_OPTS yields the same config as that in beakerx.json it's effectively a no-op and if it is different the env var should take precedence?

The only way I can think of working around this is to make values from the envar not editable in the UI, which seems annoying.

I don't think there's a need to prevent editing, we can allow edits if we don't specify --connect (by default or esp. if we notice that SPARK_OPTS and beakerx.json configs differ and possibly warn the user).

So I think having two sources for the configs could have unintended consequences.

It might but most users realize that env vars take precedence and we could explicitly document this behavior.

Does anyone but Toree use this envar? I could not find it in spark itself.

I don't think anyone, other than ApacheToree use this env. var. The thing I like about it is that it's merely well understood/defined args to spark-submit (i.e., it's not just pure Spark --conf properties - e.g, it supports all arguments - incl. things like --py-files, --driver-cores, --driver-memory, etc., to ensure that the client mode driver has the correct settings , esp. on JVM startup).

So while it seems regrettable that Toree and BeakerX have different means of configuration, since envars are not writeable, I don’t think we can not switch to SPARK_OPTS.

Do you still feel this way after this discussion?

Could you instead set up your configs via the beakerx.json file? We only just (today) provided this option. For example here is what my beaker.json file now looks like:

I definitely would, if beakerx.json provides full fidelity for all arguments and properties that my be passed to spark-submit.

jpallas commented 6 years ago

I think using environment variables to configure things that might reasonably vary between notebooks is asking for trouble. It's one thing to configure "here is where to find the Spark libraries on my system" and another thing entirely to configure "here is how many executor cores I want for every Spark notebook I open".

This cries out for some kind of profile mechanism, which logically belongs at the Jupyter level. Unfortunately, at the moment, Jupyter only offers alternative kernel specifications.

vishnu2kmohan commented 6 years ago

@jpallas

that might reasonably vary between notebooks

What might examples of such variability be, other than things like SPARK_HOME (or things that help locate core libraries and dependencies) vs. Spark-specific configuration, which is what ${SPARK_OPTS} would govern?

"here is how many executor cores I want for every Spark notebook I open"

Yep, this is exactly the desired and limited scope of responsibilities envisioned for ${SPARK_OPTS} since it's only controlling what is passed to spark-submit

Here's an example set of environment variables that result in the population of ${SPARK_OPTS} upon notebook startup, which allows Apache Toree to spin up any desired Spark configuration including making complex spark-submit invocations as easy as eval spark-submit ${SPARK_OPTS} <your job-specific spark args>

jpallas commented 6 years ago

I feel like I shouldn't leave this hanging but I'm honestly more than a little confused at this point about what we are actually discussing. Are we talking about having the BeakerX Spark support take configuration information from the SPARK_OPTS environment variable that was set when the Jupyter notebook server was launched? Are we talking about passing the current configuration of the BeakerX Spark widget to an invocation of spark-submit from inside the notebook (through the SPARK_OPTS env var)? Are we talking about some other thing that I haven't grasped? Help!

vishnu2kmohan commented 6 years ago

Are we talking about having the BeakerX Spark support take configuration information from the SPARK_OPTS environment variable that was set when the Jupyter notebook server was launched?

Yep, just this :point_up: :smile: - the other benefits are a convenient side-effect.

djannot commented 6 years ago

+1