all-spark-notebook: MLlib doesn't find numpy

boechat107 commented 8 years ago

I'm using the kernel Python 2 and trying to train a KMeans model, but it fails for not finding numpy.

Py4JJavaError: An error occurred while calling o36.trainKMeansModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy

The code I used for this test is very simple and it's posted here:

from pyspark import SparkContext, SQLContext
from pyspark.mllib.clustering import KMeans, KMeansModel
import pandas as pd
import numpy as np
import os

os.environ["PYSPARK_PYTHON"] = "python2"
sc = SparkContext("local[*]")
sqlc = SQLContext(sc)

rawDf = sqlc.createDataFrame(pd.read_csv("somefile.csv"))
avgClusters = KMeans.train(rawDf,
                           5,
                           maxIterations=10,
                           runs=10,
                           initializationMode="random")

parente commented 8 years ago

Confirmed bug on Python 2. Something with the Python path not being set properly in the Spark executor.

Sidenote: confirmed works fine on Python 3.

boechat107 commented 8 years ago

I have just solved my problem setting PYSPARK_PYTHON in a different way:

import os
os.environ["PYSPARK_PYTHON"] = os.environ["CONDA_DIR"] + "/envs/python2/bin/python2"

Calling whereis python2 inside the container gave me /usr/bin/python2, which is not probably the target.

parente commented 8 years ago

Workaround (that might be the final solution):

from pyspark import SparkContext, SQLContext
from pyspark.mllib.clustering import KMeans, KMeansModel
import pandas as pd
import numpy as np
import os
import sys

os.environ["PYSPARK_PYTHON"] = "python2"
os.environ['PYTHONPATH'] = ':'.join(sys.path)

parente commented 8 years ago

@boechat107 Nice. I like yours better. You hit the root problem: python2 finds the system python first.

boechat107 commented 8 years ago

=)

@parente Thank you anyway!

parente commented 8 years ago

I'd like to keep this open until the READMEs are updated. I'll see if there's a way to "discover" the python bin path for the kernel without having to specify it explicitly. If not, your answer is the answer. :)

jakirkham commented 8 years ago

We could activate the environment by either adding an entrypoint that runs activate (probably not so useful here) or actually just setting 3 environment variables. Unless there is some reason not to do this.

parente commented 8 years ago

Can you activate the environment from within a running notebook server? Don't want the user to have to start jupyter in one config for python 3, exit it, then start in another config for python 2.

EDIT: I routinely launch both python 2 and 3 kernels.

jakirkham commented 8 years ago

Is there some sort of config file that Jupyter runs when starting up a kernel? That might be a good way to do this then.

parente commented 8 years ago

Used to be for ipython kernels specifically. Not sure about now. @minrk ?

jakirkham commented 8 years ago

So, this is from the iPython 3 documentation, but I think it still holds. We can modify the kernel.json to include environment variables. ( https://ipython.org/ipython-doc/3/development/kernels.html#kernel-specs )

As it looks like kernel.json is non-existent for Python 3, we can just write one that has these environment variables in it. The only trick is modifying the path. I suppose we can just hardcode the path since no one should really be changing it.

parente commented 8 years ago

For the Python 3 kernel, we don't have to do anything. It works as-is since it's the default conda environment. For the Python 2 kernel, we'd want to activate the python2 env. But can the kernel spec source the env before running the kernel? Or is there a way with conda to activate an env without using source?

jakirkham commented 8 years ago

Sorry, I think I misread something. You're right.

For activating python2, we only need to set the following:

$CONDA_DEFAULT_ENV to the name environment we want.
$CONDA_ENV_PATH to the path of the environment, which should be <conda_installation>/envs/$CONDA_DEFAULT_ENV.
Ensure $CONDA_ENV_PATH/bin is first in $PATH.

On the last point, I think we just hardcode the $PATH with everything we want in it. I don't expect that to be an issue. If we can figure out something better for that case in the long run, that might be nice. Activate does some other things, but it is stuff that we can easily go without (e.g. changing $PS1).

jakirkham commented 8 years ago

FWIW, I got a listing of the kernelspec. It appears we are using the system python2. Is this correct or are we wanting one from conda?

$ jupyter kernelspec list
Available kernels:
  python3    /opt/conda/lib/python3.4/site-packages/ipykernel/resources
  ir         /opt/conda/share/jupyter/kernels/ir
  scala      /opt/conda/share/jupyter/kernels/scala
  python2    /usr/local/share/jupyter/kernels/python2

parente commented 8 years ago

Could be the dockerfile is running the system python2 instead of the one in conda, and that's why it's not getting setup properly.

EDIT: running the system python2 when installing the python2 kernel spec

jakirkham commented 8 years ago

That's what I was wondering. Even if it is not the problem, it should probably be addressed.

parente commented 8 years ago

Hmmm. Looks like it is using the conda python2, but may be missing some parameters?

https://github.com/jupyter/docker-stacks/blob/master/pyspark-notebook/Dockerfile#L57

jakirkham commented 8 years ago

After looking closer, it appears it is actually using the right python, but has opted to install the kernel in the /usr/local prefix instead.

parente commented 8 years ago

Taking a step back, there's nothing wrong with the environment for the Python 2 kernel here. I can import numpy just fine in the notebook. The problem here is that the spark worker processes are being told to use the system python2 instead of the conda python2 via the bad doc in the README. While it's odd that the python2 kernel spec winds up in /usr/local, moving it to /opt/conda/envs/python2 has no effect on the spark worker process path.

We could add the environment variable PYSPARK_PYTHON to the Python2 kernel spec itself in the pyspark-notebook and all-spark-notebook images. Then the fact that the path is being set is hidden from the user. It becomes an implicit part of the environment rather than an explicit part of the setup done in the notebook, which is a bad from a user understanding perspective. On the other hand, hardcoding a path like that in the notebook itself makes it less portable to other environments.

So, explicit config or portable notebook?

I'm leaning toward the latter. Config at the level of a path to a python2 binary feels like something that should "just work" in the notebook environment, not something that should be captured in a notebook for reproducibility.

parente commented 8 years ago

Build has completed but push to DH keeps failing with 500 errors (which is typical). Will keep retrying until it goes through and note the tag here.

parente commented 8 years ago

4th make release-all is a charm. Tagged 55d5ca6be183 on Docker Hub.

jupyter / docker-stacks

all-spark-notebook: MLlib doesn't find numpy #109