microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.04k stars 830 forks source link

MMLSpark on Cloudera #311

Open moyanojv opened 6 years ago

moyanojv commented 6 years ago

We are trying to use mmlspark in a Cloudera environment using Hue pyspark notebooks through livy. All our efforts have failed and we wonder if this option is possible. The only way we've got it working is to use pyspark without yarn.

Tested but not working: We have modify Spark 2 Client Advanced Configuration Snippet (Safety Valve) in Cloudera manager to add --packages Azure:mmlspark:0.12 (spark.jars.packages=Azure:mmlspark:0.12). With this property our livy session donwloads the packages and dependencies but we don't see anything regarding mmlspark in the session property spark.submit.pyFiles.

Here you have the spark propertes of the environment of a livy session created using the commented aproach: livy-session-7 - Environment.pdf

And here you have a screenshot of a working environment of a pyspark2 session using a diferent aproach (pyspark2 --master local --deploy-mode client --packages Azure:mmlspark:0.12): pyspark - Environment.pdf

So, here is my question: It is possible to use mmlspark on a Cloudera environment using Hue pyspark notebooks through livy?

Thanks in advance.

mhamilton723 commented 6 years ago

Hey @moyanojv , Thanks for reaching out! MMLSpark should be entirely compatible with YARN as we do not rely on a particular scheduler. Are you able to install other Spark Packages on your system? Do you get a particular error message?

moyanojv commented 6 years ago

Thanks @mhamilton723 for your help.

Right now I'm a little lost. As far as I can see, this package contains python code so I'm not sure how to install it. Do I have to install it as a python package in my environment?

mhamilton723 commented 6 years ago

@moyanojv to add a python+scala library to spark you use "spark packages". when you create or spin up your spark session, you use the --packages to attach our maven repo. If you are using pyspark, attaching this maven repo will automatically load python bindings into your interpreter. Here is the section in the readme that describes the process:

https://github.com/Azure/mmlspark#spark-package

https://github.com/Azure/mmlspark#python

Hope this helps!

moyanojv commented 6 years ago

@mhamilton723 I used this commands on my Cloudera cluster:

pyspark2 --master yarn --deploy-mode client --packages Azure:mmlspark:0.12

And the shell comes up:

Python 3.6.1 (default, Sep 22 2017, 14:27:40) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Azure#mmlspark added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found Azure#mmlspark;0.12 in spark-packages
    found io.spray#spray-json_2.11;1.3.2 in central
    found com.microsoft.cntk#cntk;2.4 in central
    found org.openpnp#opencv;3.2.0-1 in central
    found com.microsoft.ml.lightgbm#lightgbmlib;2.0.120 in central
:: resolution report :: resolve 1178ms :: artifacts dl 45ms
    :: modules in use:
    Azure#mmlspark;0.12 from spark-packages in [default]
    com.microsoft.cntk#cntk;2.4 from central in [default]
    com.microsoft.ml.lightgbm#lightgbmlib;2.0.120 from central in [default]
    io.spray#spray-json_2.11;1.3.2 from central in [default]
    org.openpnp#opencv;3.2.0-1 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   5   |   0   |   0   |   0   ||   5   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 5 already retrieved (0kB/26ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/Azure_mmlspark-0.12.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/io.spray_spray-json_2.11-1.3.2.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/com.microsoft.cntk_cntk-2.4.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/org.openpnp_opencv-3.2.0-1.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/com.microsoft.ml.lightgbm_lightgbmlib-2.0.120.jar added multiple times to distributed cache.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0.cloudera1
      /_/

Using Python version 3.6.1 (default, Sep 22 2017 14:27:40)
SparkSession available as 'spark'.
>>> 

As you can see the package is downloaded, and it seems that also is correctly installed. But when i follow the tutorial:

>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.appName("MyApp").config("spark.jars.packages", "Azure:mmlspark:0.12").getOrCreate()
>>> import mmlspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'mmlspark'
>>>

I'm doing anything wrong?

Thanks for your help.

mhamilton723 commented 6 years ago

Hmm the first line looks right, but when you use pyspark as a command you dont need to recreate the spark object as it already exists. try just

import mmlspark

and see if that works

moyanojv commented 6 years ago

@mhamilton723 here you have the result:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0.cloudera1
      /_/

Using Python version 3.6.1 (default, Sep 22 2017 14:27:40)
SparkSession available as 'spark'.
>>> import mmlspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'mmlspark'
>>> 

I have attached the spark environment information.

PySparkShell - Environment.pdf

Thanks for your help.

mhamilton723 commented 6 years ago

Thanks for the quick reply! Is it possible to try this out with spark 2.2? That's what our package was built against.

moyanojv commented 6 years ago

I'm sorry but right now this is not possible. Any way we will try to modify our spark version as soon as possible to test your suggeriment. If we change the version for sure we will add our results here.

@mhamilton723 many thanks for your help.

Regards

mhamilton723 commented 6 years ago

@moyanojv perhaps also try installing the pip package directly as it seems your spark submit is not installing the python bits as anticipated

https://mmlspark.azureedge.net/pip/mmlspark-0.12-py2.py3-none-any.whl

hanzigs commented 5 years ago

I am new to mmlspark, Can I have help on this please? from mmlspark import TrainClassifier Traceback (most recent call last): File "", line 1, in from mmlspark import TrainClassifier File "mmlspark.py", line 30, in from mmlspark.TrainClassifier import TrainClassifier ModuleNotFoundError: No module named 'mmlspark.TrainClassifier'; 'mmlspark' is not a package

imatiach-msft commented 5 years ago

@apremgeorge it looks like you are running into a similar issue, can you try to install the latest pip package for the v0.17 version here:

https://mmlspark.azureedge.net/pip/mmlspark-0.17-py2.py3-none-any.whl

imatiach-msft commented 5 years ago

@apremgeorge also, how did you install the package in cloudera? Did you specify the spark package maven coordinates somewhere? Also, do you know if the scala bindings are working and you are only having trouble with the pyspark python bindings?

hanzigs commented 5 years ago

@imatiach-msft Thank you very much for the reply, Not with cloudera install I installed from pyspark (pip install) Then trying to install in HDInsight Spark Cluster, using configurations settings for mmlspark and dependencies, running in python IDE, Thanks

Jinqiao commented 4 years ago

@imatiach-msft Hi~ I run into the same problem with 0.18.1. Where can I get a 0.18.1 wheel file? Thank you!

njgerner commented 4 years ago

Is there a reference anywhere to what wheel files are available at https://mmlspark.azureedge.net/pip/** ?

rusonding commented 3 years ago

(mmlspark) [root@hadoop51]# spark2-submit --master yarn --conf spark.pyspark.python=/usr/lib/anaconda2/envs/mmlspark/bin/python --num-executors 10 --executor-memory 15G test_mmlspark.py Traceback (most recent call last): File "/root/test/test_mmlspark.py", line 13, in from mmlspark.lightgbm import LightGBMClassifier File "/usr/lib/anaconda2/envs/mmlspark/lib/python3.6/site-packages/mmlspark/lightgbm/LightGBMClassifier.py", line 11, in from mmlspark.lightgbm._LightGBMClassifier import _LightGBMClassifier ModuleNotFoundError: No module named 'mmlspark.lightgbm._LightGBMClassifier'

certifi 2016.2.28

future 0.18.2 mmlspark 0.0.11111111 numpy 1.19.2 pip 20.2.3 py4j 0.10.7 PyHive 0.6.1 pyspark 2.4.5 python-dateutil 2.8.1 setuptools 36.4.0 six 1.15.0 wheel 0.29.0