Open patkovskyi opened 4 years ago
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
Found one very hacky way to do it: download MMLSpark with all its transitive dependencies, put all of them in a fat jar and pass it via SPARK_SUBMIT_OPTIONS --spark.jars.
mvn:package
=> mmlspark-1.0.0-rc1.jar is now in target
folder.cd target && zip -d mmlspark-1.0.0-rc1.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
to workaround manifest signature issue."SPARK_SUBMIT_OPTIONS": "\"--conf spark.jars=/tmp/mmlspark-1.0.0-rc1.jar\" \"--conf zeppelin.pyspark.python=/home/hadoop/zepython/bin/python3\""
Interestingly, this does not work for %pyspark interpreter (import mmlspark still fails), but works for %spark (Scala).
Can it be done in a simpler way? This won't work well for a team of engineers.
I spoke with a person who got MMLSpark (1.0.0-rc1) dependency via UI on locally installed Zeppelin 0.8.1 (in the same way which does not work for me). This makes me think that the problem might be specific to AWS EMR.
Found one very hacky way to do it: download MMLSpark with all its transitive dependencies, put all of them in a fat jar and pass it via SPARK_SUBMIT_OPTIONS --spark.jars.
- Assembled fake Maven pom: https://gist.github.com/patkovskyi/ea54b56afbfc3135c763694bd7ed3b0e
mvn:package
=> mmlspark-1.0.0-rc1.jar is now intarget
folder.cd target && zip -d mmlspark-1.0.0-rc1.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
to workaround manifest signature issue."SPARK_SUBMIT_OPTIONS": "\"--conf spark.jars=/tmp/mmlspark-1.0.0-rc1.jar\" \"--conf zeppelin.pyspark.python=/home/hadoop/zepython/bin/python3\""
- Uploaded jar to EMR cluster and placed it in /tmp/ as part of EMR bootstrap actions.
Interestingly, this does not work for %pyspark interpreter (import mmlspark still fails), but works for %spark (Scala).
Can it be done in a simpler way? This won't work well for a team of engineers.
I create a fat jar with mmlspark and other dependencies, but in pyspark, it still can not import lightgbm normally, with the error message "mmlspark.lightgbm._LightGBMClassifier does not exist "
I'm trying to run MMLSpark on Zeppelin instance deployed as AWS EMR cluster. It's a real Spark cluster with a master node and 10 slaves.
Versions: Scala 2.11.12, Spark 2.4.3, Zeppelin 0.8.1, EMR 5.26.0.
Here's what I tried:
Setting dependencies via interpreter settings
Dynamic dependency loading
Tried different versions (e.g. 0.18.0) and different repositories (e.g. bintray/spark-packages and default Maven) with the same result.
Interestingly enough, I found a russian article about MMLSpark which has an example notebook on Zepl (https://habr.com/ru/company/raiffeisenbank/blog/456668/ https://www.zepl.com/viewer/notebooks/bm90ZTovL3NlbWVuc2luY2hlbmtvQGdtYWlsLmNvbS9kOTkzZTk2MGUxOWE0ZDRjOGZlMDNiNzM5YTVlODQ3Mi9ub3RlLmpzb24) First cell of the notebook goes as:
So it was working at some point. But when I import this notebook and run the same code, I get "Failed to collect dependencies" error 🙄
both had the same effect — EMR cluster started just fine, but MMLSpark lib was not available in %spark or %pyspark interpreters.
I'd greatly appreciate any help! It seems that I'm using compatible versions of all related software, so it's surprising to see it fail.