microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.06k stars 832 forks source link

Failed to Load DataConversion #1378

Open jrdzha opened 2 years ago

jrdzha commented 2 years ago

Describe the bug

com.microsoft.azure.synapse.ml.featurize.DataConversion doesn't implement read(). Saving works fine. This doesn't work when used on its own (DataConversion().load()), and also doesn't work when used with MLlib's Pipeline/PipelineModel.

To Reproduce

Minimally reproducable:

import pyspark
from synapse.ml.featurize import DataConversion

spark = (
    pyspark.sql.SparkSession.builder.master("local[*]")
    .appName("App")
    .config(
        "spark.jars.packages",
        "org.apache.hadoop:hadoop-aws:3.3.1,com.microsoft.azure:synapseml_2.12:0.9.5",
    )
    .getOrCreate()
)

path = "data_conversion.stage"
stage = DataConversion(cols=["input"], convertTo="string")
stage.save(path)
DataConversion().load(path)

Expected behavior

Info (please complete the following information):

Stacktrace

22/02/03 18:47:00 ERROR Instrumentation: java.lang.NoSuchMethodException: com.microsoft.azure.synapse.ml.featurize.DataConversion.read()
        at java.lang.Class.getMethod(Class.java:1786)
        at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:631)
        at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
        at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
        at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
        at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
        at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
        at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
        at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
        at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
        at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
        at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
        at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
        at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:748)
mhamilton723 commented 2 years ago

I think because load is a static method it should be DataConversion.load(path) LMK if this fixes you and feel free to re-open if not

jrdzha commented 2 years ago

@mhamilton723 I just tried it with DataConversion.load(path) and it's still giving me this error: com.microsoft.azure.synapse.ml.featurize.DataConversion.read does not exist in the JVM. The DataConversion load also doesn't work when used in a Pipeline, so it seems like an implementation mistake on DataConversion?

mhamilton723 commented 2 years ago

We test this as part of the build but will take a look. Thanks for bringing this back up!

jrdzha commented 2 years ago

@mhamilton723 Really appreciate it! Please let me know if there's any way I can help.

jrdzha commented 2 years ago

@mhamilton723 Sorry for pushing, but this is really blocking for us. Any updates, or any way I can help? Thanks..