salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Write and read Spark stages to/from MLeap instead of Spark classes #475

Closed leahmcguire closed 4 years ago

leahmcguire commented 4 years ago

Related issues Currently, Spark save method is used to serialize and deserialize Spark wrapped stages. This PR changes the underlying serialization to write and read from MLeap bundles.

Describe the proposed solution Writes to MLeap and reads from MLeap with fallback to trying to read from Spark save.

Describe alternatives you've considered N/A

Additional context Next steps will be PR's to read the stages directly with the MLeap context rather than the Spark context for local scoring (and possibly all scoring - to better optimize the DAG)

codecov[bot] commented 4 years ago

Codecov Report

Merging #475 into master will decrease coverage by 0.30%. The diff coverage is 70.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #475      +/-   ##
==========================================
- Coverage   87.04%   86.74%   -0.31%     
==========================================
  Files         346      346              
  Lines       11782    11848      +66     
  Branches      385      374      -11     
==========================================
+ Hits        10256    10277      +21     
- Misses       1526     1571      +45     
Impacted Files Coverage Δ
...impl/classification/OpDecisionTreeClassifier.scala 63.63% <ø> (-7.80%) :arrow_down:
...p/stages/impl/classification/OpGBTClassifier.scala 46.66% <ø> (-8.89%) :arrow_down:
...ges/impl/classification/OpLogisticRegression.scala 56.00% <ø> (-4.72%) :arrow_down:
...ssification/OpMultilayerPerceptronClassifier.scala 60.00% <ø> (-9.24%) :arrow_down:
...e/op/stages/impl/classification/OpNaiveBayes.scala 71.42% <ø> (-8.58%) :arrow_down:
...impl/classification/OpRandomForestClassifier.scala 66.66% <ø> (-5.56%) :arrow_down:
...ages/impl/regression/OpDecisionTreeRegressor.scala 50.00% <ø> (ø)
...rce/op/stages/impl/regression/OpGBTRegressor.scala 53.33% <ø> (ø)
...op/stages/impl/regression/OpLinearRegression.scala 76.00% <ø> (ø)
...ages/impl/regression/OpRandomForestRegressor.scala 50.00% <ø> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a350711...38960e9. Read the comment docs.

leahmcguire commented 4 years ago

@TuanNguyen27 the test that you put in that should have failed on the local XGboost is (correctly) failing in this PR.

tovbinm commented 4 years ago

🥳 🥳 🥳

koertkuipers commented 4 years ago

this seems to have broken some of our inhouse unit tests. in some cases it was because we wrote to relative paths i think. those were easily fixed by making paths absolute. in other situations the paths were absolute and i am unsure why it broke at this point...

stacktraces all have to do with mleap BundleFile on reading and writing. always the same NPE in UnixPath.normalizeAndCheck. for example:

[info]   Cause: java.lang.NullPointerException:
[info]   at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77)
[info]   at sun.nio.fs.UnixPath.<init>(UnixPath.java:71)
[info]   at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
[info]   at ml.combust.bundle.BundleFile$.apply(BundleFile.scala:59)
[info]   at ml.combust.bundle.BundleFile$.apply(BundleFile.scala:40)
[info]   at com.salesforce.op.stages.SparkStageParam.$anonfun$jsonDecodeMleap$1(SparkStageParam.scala:164)
[info]   at resource.DefaultManagedResource.open(AbstractManagedResource.scala:110)
[info]   at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:87)
[info]   at resource.DeferredExtractableManagedResource.either(AbstractManagedResource.scala:29)
[info]   at resource.DeferredExtractableManagedResource.opt(AbstractManagedResource.scala:31)
[info]   at com.salesforce.op.stages.SparkStageParam.jsonDecodeMleap(SparkStageParam.scala:173)
[info]   at com.salesforce.op.stages.SparkStageParam.jsonDecode(SparkStageParam.scala:123)
[info]   at com.salesforce.op.stages.SparkStageParam.jsonDecode(SparkStageParam.scala:55)
[info]   at org.apache.spark.ml.util.DefaultParamsReader$Metadata.$anonfun$setParams$1(ReadWrite.scala:564)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.apache.spark.ml.util.DefaultParamsReader$Metadata.setParams(ReadWrite.scala:561)
[info]   at org.apache.spark.ml.util.DefaultParamsReader$Metadata.getAndSetParams(ReadWrite.scala:549)
[info]   at org.apache.spark.ml.SparkDefaultParamsReadWrite$.getAndSetParams(SparkDefaultParamsReadWrite.scala:126)
koertkuipers commented 4 years ago

is protobuf 3 going to be an issue on spark/hadoop?

tovbinm commented 4 years ago

@koertkuipers Can you please open an issue to track this? Can you also share which transformer / estimator are you using in your workflow?

koertkuipers commented 4 years ago

https://github.com/salesforce/TransmogrifAI/issues/514

On Mon, Sep 21, 2020 at 7:52 PM Matthew Tovbin notifications@github.com wrote:

@koertkuipers https://github.com/koertkuipers Can you please open an issue to track this? Can you also share which transformer / estimator are you using in your workflow?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/TransmogrifAI/pull/475#issuecomment-696440342, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGIQJE22J6KKLMXXQWMUODSG7RKZANCNFSM4MZ6P4JQ .

salesforce-cla[bot] commented 3 years ago

Thanks for the contribution! It looks like @leahmcguire is an internal user so signing the CLA is not required. However, we need to confirm this.

salesforce-cla[bot] commented 3 years ago

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire l***@s***.com Leah McGuire l***@s***.com. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.