microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 831 forks source link

Can't integrate with Zeppelin — Failed to collect dependencies #790

Open patkovskyi opened 4 years ago

patkovskyi commented 4 years ago

I'm trying to run MMLSpark on Zeppelin instance deployed as AWS EMR cluster. It's a real Spark cluster with a master node and 10 slaves.

Versions: Scala 2.11.12, Spark 2.4.3, Zeppelin 0.8.1, EMR 5.26.0.

Here's what I tried:

  1. Setting dependencies via interpreter settings Screenshot 2020-01-30 at 19 33 14

    Screenshot 2020-01-30 at 19 28 43

    Screenshot 2020-01-30 at 19 28 50

  2. Dynamic dependency loading

%spark.dep
z.reset()
z.addRepo("MMLSpark").url("https://mmlspark.azureedge.net/maven")
z.load("com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1")

org.sonatype.aether.resolution.DependencyResolutionException: Failed to collect dependencies for com.microsoft.ml.spark:mmlspark_2.11:jar:1.0.0-rc1 (compile)

Tried different versions (e.g. 0.18.0) and different repositories (e.g. bintray/spark-packages and default Maven) with the same result.

Interestingly enough, I found a russian article about MMLSpark which has an example notebook on Zepl (https://habr.com/ru/company/raiffeisenbank/blog/456668/ https://www.zepl.com/viewer/notebooks/bm90ZTovL3NlbWVuc2luY2hlbmtvQGdtYWlsLmNvbS9kOTkzZTk2MGUxOWE0ZDRjOGZlMDNiNzM5YTVlODQ3Mi9ub3RlLmpzb24) First cell of the notebook goes as:

%spark.dep
z.addRepo("bintray.com").url("http://dl.bintray.com/spark-packages/maven/")
z.load("Azure:mmlspark:0.17")

res0: org.apache.zeppelin.dep.Dependency = org.apache.zeppelin.dep.Dependency@6b777876

So it was working at some point. But when I import this notebook and run the same code, I get "Failed to collect dependencies" error 🙄

  1. SPARK_SUBMIT_OPTIONS I tried both formats (--packages --repositories and --conf)
    {
    "Classification": "zeppelin-env",
    "Configurations": [
      {
        "Classification": "export",
        "Configurations": [],
        "Properties": {
          "SPARK_SUBMIT_OPTIONS": "\"--repositories=https://mmlspark.azureedge.net/maven\" \"--packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1\" \"--conf zeppelin.pyspark.python=/home/hadoop/zepython/bin/python3\"",
          "PYSPARK_DRIVER_PYTHON": "/home/hadoop/zepython/bin/python3",
          "PYSPARK_PYTHON": "/home/hadoop/zepython/bin/python3"
        }
      }
    ]
    }
  {
    "Classification": "zeppelin-env",
    "Configurations": [
      {
        "Classification": "export",
        "Configurations": [],
        "Properties": {
          "SPARK_SUBMIT_OPTIONS": "\"--conf spark.jars.repositories=https://mmlspark.azureedge.net/maven\" \"--conf spark.jars.packages=com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1\" \"--conf zeppelin.pyspark.python=/home/hadoop/zepython/bin/python3\"",
          "PYSPARK_DRIVER_PYTHON": "/home/hadoop/zepython/bin/python3",
          "PYSPARK_PYTHON": "/home/hadoop/zepython/bin/python3"
        }
      }
    ]
  }

both had the same effect — EMR cluster started just fine, but MMLSpark lib was not available in %spark or %pyspark interpreters.

I'd greatly appreciate any help! It seems that I'm using compatible versions of all related software, so it's surprising to see it fail.

welcome[bot] commented 4 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

patkovskyi commented 4 years ago

Found one very hacky way to do it: download MMLSpark with all its transitive dependencies, put all of them in a fat jar and pass it via SPARK_SUBMIT_OPTIONS --spark.jars.

  1. Assembled fake Maven pom: https://gist.github.com/patkovskyi/ea54b56afbfc3135c763694bd7ed3b0e
  2. mvn:package => mmlspark-1.0.0-rc1.jar is now in target folder.
  3. cd target && zip -d mmlspark-1.0.0-rc1.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF to workaround manifest signature issue.
  4. "SPARK_SUBMIT_OPTIONS": "\"--conf spark.jars=/tmp/mmlspark-1.0.0-rc1.jar\" \"--conf zeppelin.pyspark.python=/home/hadoop/zepython/bin/python3\""
  5. Uploaded jar to EMR cluster and placed it in /tmp/ as part of EMR bootstrap actions.

Interestingly, this does not work for %pyspark interpreter (import mmlspark still fails), but works for %spark (Scala).

Can it be done in a simpler way? This won't work well for a team of engineers.

patkovskyi commented 4 years ago

I spoke with a person who got MMLSpark (1.0.0-rc1) dependency via UI on locally installed Zeppelin 0.8.1 (in the same way which does not work for me). This makes me think that the problem might be specific to AWS EMR.

Ereebay commented 3 years ago

Found one very hacky way to do it: download MMLSpark with all its transitive dependencies, put all of them in a fat jar and pass it via SPARK_SUBMIT_OPTIONS --spark.jars.

  1. Assembled fake Maven pom: https://gist.github.com/patkovskyi/ea54b56afbfc3135c763694bd7ed3b0e
  2. mvn:package => mmlspark-1.0.0-rc1.jar is now in target folder.
  3. cd target && zip -d mmlspark-1.0.0-rc1.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF to workaround manifest signature issue.
  4. "SPARK_SUBMIT_OPTIONS": "\"--conf spark.jars=/tmp/mmlspark-1.0.0-rc1.jar\" \"--conf zeppelin.pyspark.python=/home/hadoop/zepython/bin/python3\""
  5. Uploaded jar to EMR cluster and placed it in /tmp/ as part of EMR bootstrap actions.

Interestingly, this does not work for %pyspark interpreter (import mmlspark still fails), but works for %spark (Scala).

Can it be done in a simpler way? This won't work well for a team of engineers.

I create a fat jar with mmlspark and other dependencies, but in pyspark, it still can not import lightgbm normally, with the error message "mmlspark.lightgbm._LightGBMClassifier does not exist "