krishnan-r / sparkmonitor

Monitor Apache Spark from Jupyter Notebook
https://krishnan-r.github.io/sparkmonitor/
Apache License 2.0
172 stars 55 forks source link

Test compatibility with PYSPARK_SUBMIT_ARGS #10

Closed AbdealiLoKo closed 6 years ago

AbdealiLoKo commented 6 years ago

Based on the discussion at https://github.com/krishnan-r/sparkmonitor/issues/6#issuecomment-392330780

The extension is doing an import pyspark inside the extension. Which means, that if I as a jupyter user want to do something like:

import os

spark_pkgs=('com.amazonaws:aws-java-sdk:1.7.4',
            'org.apache.hadoop:hadoop-aws:2.7.3',
            'joda-time:joda-time:2.9.3',)

os.environ['PYSPARK_SUBMIT_ARGS'] = (
    '--packages {spark_pkgs} pyspark-shell'.format(spark_pkgs=",".format(spark_pkgs)))

import findspark
findspark.init()
import pyspark

spark = pyspark.sql.SparkSession.builder \
    .getOrCreate()

I cannot, because the PYSPARK_SUBMIT_ARGS environment variable will be created after the pyspark imported in the sparkmonitor module.

krishnan-r commented 6 years ago

Can you confirm that setting the environment variable is not working?

I think that the environment variable is read by Spark only while the SparkContext object is created. The extension only imports pyspark and creates a SparkConf object. If I'm not wrong, you can still add properties to conf and as well set environment variables before starting the context. (Here again you must pass the conf to create the SparkContext for the extension to work.)

AbdealiLoKo commented 6 years ago

You're right. The PYSPARK_SUBMIT_ARGS are not used only in the case of the PySpark kernel in jupyter. But that is because the PySpark kernel initializes the SparkContext internally and hence the args don't work (as sparkcontext has been initialized already)

An observation: It does look like sparkmonitor won't work correctly with the PySpark Kernel as PySparkKernel won't use the conf created by sparkmonitor.

Closing as this is not an issue.