microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 832 forks source link

[BUG] Unable to Get JAR files installed on cluster #1805

Open Mike-Soukup opened 1 year ago

Mike-Soukup commented 1 year ago

SynapseML version

0.10.2

System information

Describe the problem

When trying to run the synapse.ml LightGBMClassifier, I receive the error:

TypeError: 'JavaPackage' object is not callable

Stack Trace:

TypeError                                 Traceback (most recent call last)
<ipython-input-35-4c5d3604aa9a> in <module>
      1 ## Set up Synapse Model...
----> 2 model = LightGBMClassifier()

/opt/spark-3.0.0/python/pyspark/__init__.py in wrapper(self, *args, **kwargs)
    108             raise TypeError("Method %s forces keyword arguments." % func.__name__)
    109         self._input_kwargs = kwargs
--> 110         return func(self, **kwargs)
    111     return wrapper
    112 

~/.local/lib/python3.7/site-packages/synapse/ml/lightgbm/LightGBMClassifier.py in __init__(self, java_obj, baggingFraction, baggingFreq, baggingSeed, binSampleCount, boostFromAverage, boostingType, catSmooth, categoricalSlotIndexes, categoricalSlotNames, catl2, chunkSize, dataRandomSeed, defaultListenPort, deterministic, driverListenPort, dropRate, dropSeed, earlyStoppingRound, executionMode, extraSeed, featureFraction, featureFractionByNode, featureFractionSeed, featuresCol, featuresShapCol, fobj, improvementTolerance, initScoreCol, isEnableSparse, isProvideTrainingMetric, isUnbalance, labelCol, lambdaL1, lambdaL2, leafPredictionCol, learningRate, matrixType, maxBin, maxBinByFeature, maxCatThreshold, maxCatToOnehot, maxDeltaStep, maxDepth, maxDrop, metric, microBatchSize, minDataInLeaf, minDataPerBin, minDataPerGroup, minGainToSplit, minSumHessianInLeaf, modelString, monotoneConstraints, monotoneConstraintsMethod, monotonePenalty, negBaggingFraction, numBatches, numIterations, numLeaves, numTasks, numThreads, objective, objectiveSeed, otherRate, parallelism, passThroughArgs, posBaggingFraction, predictDisableShapeCheck, predictionCol, probabilityCol, rawPredictionCol, repartitionByGroupingColumn, seed, skipDrop, slotNames, thresholds, timeout, topK, topRate, uniformDrop, useBarrierExecutionMode, useMissing, useSingleDatasetMode, validationIndicatorCol, verbosity, weightCol, xGBoostDartMode, zeroAsMissing)
    387         super(LightGBMClassifier, self).__init__()
    388         if java_obj is None:
--> 389             self._java_obj = self._new_java_obj("com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier", self.uid)
    390         else:
    391             self._java_obj = java_obj

/opt/spark-3.0.0/python/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
     67             java_obj = getattr(java_obj, name)
     68         java_args = [_py2java(sc, arg) for arg in args]
---> 69         return java_obj(*java_args)
     70 
     71     @staticmethod

TypeError: 'JavaPackage' object is not callable

I noticed this is an issue with having the correct .jar files, so I set up my configurations as such per the website documentation:

...
    # Configs for SynapseML:
    "spark.jars.repositories":"https://mmlspark.azureedge.net/maven",
    "spark.jars.packages":"com.microsoft.azure:synapseml_2.12:0.10.2",
}

However, when I try to launch my Spark Cluster, I get the following error:

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.FileNotFoundException: File file:/home/notebook/.ivy2/jars/com.microsoft.azure_onnx-protobuf_2.12-0.9.1.jar does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1529)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1493)
    at org.apache.spark.SparkContext.$anonfun$new$12(SparkContext.scala:489)
    at org.apache.spark.SparkContext.$anonfun$new$12$adapted(SparkContext.scala:489)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:489)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

I checked my /home/notebook/.ivy2/jars directory and there is a com.microsoft.azure_onnx-protobuf_2.12-0.9.1-assembly.jar there.

I am not exactly sure where to go from here. I am not familiar with Java packages and dependencies, but my understanding is the assembly.jar should have all the files I need... I also tried using the 0.9.5-13-d1b51517-SNAPSHOT and got a similar error just missing a different file name. Please advise on how I can get these .jar files into my Spark Cluster so I can train my LightGBM model across my Spark Executors instead of just over the Spark Driver.

Code to reproduce issue

import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp")
        .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.10.2") # Please use 0.10.2 version for Spark3.2 and 0.9.5-13-d1b51517-SNAPSHOT version for Spark3.1
        .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
        .getOrCreate()

Other info / logs

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-8-7aa1f519ff51> in <module>
      1 import pyspark
----> 2 spark = pyspark.sql.SparkSession.builder.appName("MyApp").config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.10.2").config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven").getOrCreate()

/opt/spark-3.0.0/python/pyspark/sql/session.py in getOrCreate(self)
    184                             sparkConf.set(key, value)
    185                         # This SparkContext may be an existing one.
--> 186                         sc = SparkContext.getOrCreate(sparkConf)
    187                     # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    188                     # by all sessions.

/opt/spark-3.0.0/python/pyspark/context.py in getOrCreate(cls, conf)
    369         with SparkContext._lock:
    370             if SparkContext._active_spark_context is None:
--> 371                 SparkContext(conf=conf or SparkConf())
    372             return SparkContext._active_spark_context
    373 

/opt/spark-3.0.0/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    129         try:
    130             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
--> 131                           conf, jsc, profiler_cls)
    132         except:
    133             # If an error occurs, clean up in order to allow future SparkContext creation:

/opt/spark-3.0.0/python/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
    191 
    192         # Create the Java SparkContext through Py4J
--> 193         self._jsc = jsc or self._initialize_context(self._conf._jconf)
    194         # Reset the SparkConf to the one actually used by the SparkContext in JVM.
    195         self._conf = SparkConf(_jconf=self._jsc.sc().conf())

/opt/spark-3.0.0/python/pyspark/context.py in _initialize_context(self, jconf)
    308         Initialize SparkContext in function to allow subclass specific initialization
    309         """
--> 310         return self._jvm.JavaSparkContext(jconf)
    311 
    312     @classmethod

/opt/spark-3.0.0/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1567         answer = self._gateway_client.send_command(command)
   1568         return_value = get_return_value(
-> 1569             answer, self._gateway_client, None, self._fqn)
   1570 
   1571         for temp_arg in temp_args:

/opt/spark-3.0.0/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.FileNotFoundException: File file:/home/notebook/.ivy2/jars/com.microsoft.azure_onnx-protobuf_2.12-0.9.1.jar does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1529)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1493)
    at org.apache.spark.SparkContext.$anonfun$new$12(SparkContext.scala:489)
    at org.apache.spark.SparkContext.$anonfun$new$12$adapted(SparkContext.scala:489)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:489)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

github-actions[bot] commented 1 year ago

Hey @Mike-Soukup :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

JessicaXYWang commented 1 year ago

Hi @Mike-Soukup , thanks for raising this issue.

Can you install other packages on your system in your current way?


You are on spark 3.0, and there might be some dependency issue with SynapseML 0.10.2.

[deleted, solution is not correct]

And if it still not works, can you try manually renaming the jars in their directory? This discussion might be useful as well: https://github.com/microsoft/SynapseML/issues/1374

Mike-Soukup commented 1 year ago

@JessicaXYWang I ran into the same issues with the 0.9.5-13-d1b51517-SNAPSHOT package.

I can try to manually rename the files. Is there any specific content that should be in them that provides functionality? Or do they just need to be present for whatever reason?

JessicaXYWang commented 1 year ago

@Mike-Soukup It seems the jar is not installed. Can you help to check if you can install any other packages on your system?

Mike-Soukup commented 1 year ago

@JessicaXYWang I was able to rename the .jar files on my system and got around the initial BUG. Now when I try to train my model, I get the following error: image Is SynapseML not compatible with Spark 3.0.0? There seem to be dependency issues with the Java installation... Here my my clusters' jars... image

JessicaXYWang commented 1 year ago

@Mike-Soukup Thanks you for the feedback. The compatible versions are not correct on the website. Sorry for the confusion. The documentation will be fixed soon with this PR

I can't find a version that's compatible with Spark 3.0.0, @serena-ruan , do you have more information on it?

I would recommend use Spark 3.2 with the latest SynapseML 0.10.2 version. Or spark 3.1 with synapseml_2.12:0.9.5-13-d1b51517-SNAPSHOT

Mike-Soukup commented 1 year ago

@JessicaXYWang Ok. That's unfortunate. But thank you for bringing clarity and guidance to this issue.

flavajava commented 1 year ago

@JessicaXYWang I am running into this same issue attempting to use SynapseML 0.10.2 on Spark 3.3.1, so it seems like your suggestion to use a more recent Spark version will not/does not fix this issue.

@Mike-Soukup did you attempt to upgrade your Spark version per the suggestion, and if so, did it work?

JessicaXYWang commented 1 year ago

Hi @flavajava thanks for raise this question.

We are aware the package management issue on spark 3.3 @KeerthiYandaOS can you share more information on this?

KeerthiYandaOS commented 1 year ago

@flavajava 0.10.2 version is for Spark 3.2 cluster. Can you please use 0.10.1-69-84f5b579-SNAPSHOT version for Spark 3.3.1. Support for Spark 3.3(.x) is still in progress so the website is not updated yet.

flavajava commented 1 year ago

@KeerthiYandaOS I attempted to install the snapshot you indicated on a cluster running Spark 3.3.1 and it failed with the same reason that 0.10.2 failed (on Spark 3.3.1).

I have downgraded my Spark to 3.2.1 and was able to get 0.10.2 to install on it.

I will continue to track this issue to see when 0.10.2 is able to work on Spark 3.3.1.

In the meantime, I think it would be helpful if you made it unambiguously clear on the Installation page of your documentation site that Synapse will NOT work on Spark 3.3.x until it actually does work. This would have saved me a few days of chasing and would potentially save others time as well.

Thanks so much for your responsiveness to my post.

KeerthiYandaOS commented 1 year ago

@flavajava Can you please share when and on which platform you are seeing the error for Spark3.3?

flavajava commented 1 year ago

This was attempting to install SynapseML 0.10.2 on the Databricks ML Runtime 12.1 (which includes Python 3.9.5, Spark 3.3.1)

mhamilton723 commented 1 year ago

I believe this is the same same issue as #1817

KeerthiYandaOS commented 1 year ago

@flavajava Because of the maven resolution change represented starting from DBR 11.0 we are seeing this issue. Can you please use spark.databricks.libraries.enableMavenResolution false spark configuration on your cluster and try if that helps.

flavajava commented 1 year ago

@KeerthiYandaOS not sure if something was changed either by y'all or if Databricks updated the runtime, but I just tried installing SynapseML 0.10.2 on the Databricks ML Runtime 12.1 (which includes Python 3.9.5, Spark 3.3.1) (the same one I was using a week or two ago) and the installation succeeded even without using spark.databricks.libraries.enableMavenResolution false Not sure what changed where, but I'll take it! Thank you for helping me work through this issue

KeerthiYandaOS commented 1 year ago

@flavajava SynapseML 0.10.2 version if for Spark3.2 and 0.10.1-69-84f5b579-SNAPSHOT version is for Spark 3.3.1(please make sure you are using appropriate SynapseML version). Not sure how it worked two weeks ago, but we haven't changed or deployed any changes to these versions. For DBR 11.0 and above (with spark3.3), you need spark.databricks.libraries.enableMavenResolution false property to resolve the dependency issue.

flavajava commented 1 year ago

Well what I'm seeing is that "For DBR 11.0 and above (with spark3.3), you need spark.databricks.libraries.enableMavenResolution false property to resolve the dependency issue." is no longer actually the case. Because I haven't set spark.databricks.libraries.enableMavenResolution to false and now the installation of 0.10.2 is working (on DBR 12.1 ML which has Spark 3.3.1)

KeerthiYandaOS commented 1 year ago

Thank you @flavajava and @Mike-Soukup. Like @flavajava mentioned, DBR 11.0 doesn't require spark.databricks.libraries.enableMavenResolution false anymore, I could install the jar without that property as well. We also have a new SynapseML version for Spark3.3 which overcomes the need for spark maven resolution property: com.microsoft.azure:synapseml_2.12:0.11.0-32-6085190e-SNAPSHOT. Either way, we should be good with Spark3.3 on DBR 11.

Closing this issue as the solution is posted. Please feel free to open it if you are still facing errors. Thank you.

HoagieFestDS commented 1 year ago

@KeerthiYandaOS - Do you know if there is a SNAPSHOT that works with DBR 12.2 ML? I haven't been able to load any of those JAR successfully.

antonquintela commented 10 months ago

I have the same problem @HoagieFestDS, and I haven´t been able to solve it. @KeerthiYandaOS any ideas?

Thank you

HoagieFestDS commented 10 months ago

@antonquintela: we have com.microsoft.azure:synapseml_2.12:0.11.2 loaded on DBR 10.4, which worked. It's a deprecated runtime unfortunately, but it still works

dndharini commented 8 months ago

Library installation attempted on the driver node of cluster 0214-173402-dkcgo9cm and failed. Please refer to the following error message to fix the library or contact Databricks support. Error Code: DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: Library resolution failed because com.linkedin.isolation-forest:isolation-forest_3.2.0_2.12 download failed.