salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Support Scala 2.12 #332

Open koertkuipers opened 5 years ago

koertkuipers commented 5 years ago

Spark 2.4 has Scala 2.11 and 2.12 builds. Starting with Spark 3.0 Scala 2.12 will become default and Scala 2.11 will be dropped.

This depends on https://github.com/salesforce/TransmogrifAI/issues/184

This will require:

koertkuipers commented 5 years ago

Starting from #184 first issue i ran into is gradle-scalastyle-plugin not found for scala 2.12. resolved by adding this maven repo:

        maven {
            name = "ngbinh"
            url = "https://dl.bintray.com/ngbinh/maven"
        }
koertkuipers commented 5 years ago

To stop scala compiler from exiting with strange messages like assertion failed: ClassBType.info not yet assigned: Lorg/apache/spark/rdd/RDD; i removed -optimize flag

tovbinm commented 5 years ago

Just a heads up - with merged #274 and with Scala 2.12 compile the majority of the tests will fail, e.g. UnaryLamdaTransformerTest, due to stage serialization issues. We need to update all lambda functions with concrete classes as explained in #274

tovbinm commented 5 years ago

Thanks for making the effort to upgrade and support both Scala 2.11/12 versions @koertkuipers !!

koertkuipers commented 5 years ago

after publishing my own in-house versions of xgboost and mleap it seems everything compiles and tests fine using scala 2.12. seemed a little too easy... wonder if i am not running tests correctly.

this is before https://github.com/salesforce/TransmogrifAI/pull/274 i get a bunch of merge conflicts with that one that i need to take a look at. since i didnt make any significant changes the merge conflicts are likely with spark-2.4 branch.

koertkuipers commented 5 years ago

xgboost not having scala version in package name doesnt help e.g. xgboost4j-spark-0.90.jar vs xgboost4j-spark_2.11-0.90.jar

koertkuipers commented 5 years ago

see also https://github.com/dmlc/xgboost/pull/4574

koertkuipers commented 5 years ago

i build xgboost 0.90 for scala 2.11/2.12 here and published it to our bintray

i build mleap for scala 2.11/2.12 here and published it also to our bintray

my branch for transmogrifai for scala 2.12 is feat-scala212. its currently set to build for only scala 2.12 since i don't yet know how to do both with gradle in one pass but it also builds fine for scala 2.11 if you simply change scalaVersion and scalaVersionRevision

tovbinm commented 5 years ago

Nice! Please pull the latest from https://github.com/salesforce/TransmogrifAI/tree/mt/spark-2.4 since I just updated it.

I would like to try adding a cross build for 2.11/12

koertkuipers commented 5 years ago

i expected to see serialization issues after merging in https://github.com/salesforce/TransmogrifAI/tree/mt/spark-2.4 due to https://github.com/salesforce/TransmogrifAI/pull/274 but i don't :confused:

i see you put a lot of functions in companion objects like:

object BinaryTransformerTest {
  def fn: (Real, RealNN) => Real = (i1, i2) => new Real(for {v1 <- i1.value; v2 <- i2.value} yield v1 / (v2 * v2))
}

do you expect this approach to work for scala 2.12?

tovbinm commented 5 years ago

I think it works only because models are serialized and loaded within the same code. If you train a model, save it then recompile the codebase and try to load the model - it would fail.

wsuchy commented 5 years ago

@koertkuipers I agree here with @tovbinm, the problem is that in Scala 2.12 all lambda functions are given random names for each time you compile your code. So model trained on your laptop will not load in your production environment due to missing references. We are currently working on the fix for that.

koertkuipers commented 5 years ago

@tovbinm @wsuchy is there any way you can think of to test this? would forking for tests do it?

tovbinm commented 5 years ago

The test should be constructed a follows:

  1. Train several models (binclass, multi class and regression) with all feature types (example).
  2. Save these models as a part of core/test/resources like this one.
  3. Add a new test that would load these models and score them.

This way we would have a set of models what would contain stages saved from previous JVM runs and would allow us testing if the models would load & produce scores correctly.

tovbinm commented 5 years ago

I propose to follow the migration guide for 2.12 in #274 and replace all the lambda functions introduced with concrete classes.

koertkuipers commented 5 years ago

after publishing my own in-house versions of xgboost and mleap it seems everything compiles and tests fine using scala 2.12. seemed a little too easy... wonder if i am not running tests correctly.

this is before #274 i get a bunch of merge conflicts with that one that i need to take a look at. since i didnt make any significant changes the merge conflicts are likely with spark-2.4 branch.

i spoke to soon about everything testing fine. i was using compileTestScala but now i tried test which seems to run more tests and some fail. so i got some more work to do.

koertkuipers commented 5 years ago

all the tests except for one pass before #274. the only failure before #274 is:

  [-] RecordInsightsLOCO should return the most predictive features for data generated with a strong relation to the label (1.33 secs)

com.salesforce.op.stages.impl.insights.RecordInsightsLOCOTest > RecordInsightsLOCO should return the most predictive features for data generated with a strong relation to the label FAILED
    org.scalatest.exceptions.TestFailedException: 0.953500197613813 was not less than 0.8 The ratio of feature strengths between important and other features should be similar to the ratio of feature importances from Spark's RandomForest

after #274 i have lots of failures. but this is good news as this was expected and we know how to fix them. the failures look like this:

    java.lang.RuntimeException: Failed to write out stage 'uid_1234'

        Caused by:
        java.lang.RuntimeException: Argument 'transformFn' [com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838] cannot be serialized. Make sure com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838 has either no-args ctor or is an object, and does not have any external dependencies, e.g. use any out of scope variables.

            Caused by:
            java.lang.RuntimeException: Failed to create an instance of class 'com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838'. Class has to either have a no-args ctor or be an object.

                Caused by:
                java.lang.ClassNotFoundException: com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838
koertkuipers commented 5 years ago

i tried to convert some lambdas to concrete classes as explained in #274 migration notes.

for example in PassengerFeaturesTest.scala this:

def genderFn: Passenger => MultiPickList = p => Set(p.getGender).toMultiPickList

i replaced by:

  class GenderFn extends Function1[Passenger, MultiPickList] {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
  }

this did not work. i now get:

        java.io.NotSerializableException: com.salesforce.op.test.PassengerFeaturesTestLambdas$GenderFn

however it works if i add Serializable:

  class GenderFn extends Function1[Passenger, MultiPickList] with Serializable {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
  }
koertkuipers commented 5 years ago

i tried to convert some lambdas to concrete classes as explained in #274 migration notes.

for example in PassengerFeaturesTest.scala this:

def genderFn: Passenger => MultiPickList = p => Set(p.getGender).toMultiPickList

i replaced by:

  class GenderFn extends Function1[Passenger, MultiPickList] {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
  }

this did not work. i now get:

        java.io.NotSerializableException: com.salesforce.op.test.PassengerFeaturesTestLambdas$GenderFn

however it works if i add Serializable:

  class GenderFn extends Function1[Passenger, MultiPickList] with Serializable {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
  }

i always thought scala functions extend serializable. but its a little more complex than that:

scala> class X extends Function1[String, String]{ def apply(s: String) = s + "!" }
defined class X

scala> val x = new X
x: X = <function1>

scala> x.asInstanceOf[Serializable]
java.lang.ClassCastException: X cannot be cast to scala.Serializable
  ... 32 elided

scala> val y = { (s: String) => s + "!" }
y: String => String = <function1>

scala> y.asInstanceOf[Serializable]
res6: Serializable = <function1>
koertkuipers commented 5 years ago

@koertkuipers I agree here with @tovbinm, the problem is that in Scala 2.12 all lambda functions are given random names for each time you compile your code. So model trained on your laptop will not load in your production environment due to missing references. We are currently working on the fix for that.

@wsuchy @tovbinm is what i am seeing the same issue? or is this something else?

i am seeing:

    java.lang.RuntimeException: Failed to write out stage 'uid_1234'

        Caused by:
        java.lang.RuntimeException: Argument 'transformFn' [com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838] cannot be serialized. Make sure com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838 has either no-args ctor or is an object, and does not have any external dependencies, e.g. use any out of scope variables.

            Caused by:
            java.lang.RuntimeException: Failed to create an instance of class 'com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838'. Class has to either have a no-args ctor or be an object.

                Caused by:
                java.lang.ClassNotFoundException: com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838
wsuchy commented 5 years ago

@koertkuipers the way I see this working is to explicitly wrap lambdas inside a specific object (one lambda per one object) e.g:

trait TMOGLambda extends Serializable{
  val f: (_) => _
}
object Lambda1 extends TMOGLambda {
 override val f =  (x: Int)=> x.toString
}

And then have this supported by serializer to check presence of that object and hardcode it to always point to f or whatever we name the property holding our lambda.

tovbinm commented 5 years ago

I think it cleaner to have classes extend FunctionX base classes, i.e.

class GenderFn extends Function1[Passenger, MultiPickList] with Serializable {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
}
wsuchy commented 5 years ago

@tovbinm the problem will be supporting it from macros, as at the time we evaluate them there is type information.

koertkuipers commented 5 years ago

@tovbinm @wsuchy take a look at https://github.com/tresata-opensource/TransmogrifAI/commit/95095ed870116ebcfb9759a3ab04d7679989f5c5 i simply replaced lambdas with concrete classes until all the tests passed. its not pretty but it works.

i am unsure if i have to add more tests for workflow independent model loading, e.g. load model from json files.

wsuchy commented 5 years ago

Found this out today:

In our hello world example, the code like this:

val sepalLength = FeatureBuilder.Real[Iris].extract(_.sepalLength.toReal).asPredictor

that result in json containing:

 "extractFn" : {
        "className" : "com.salesforce.hw.OpIrisSimple$$anonfun$5",
        "value" : {
          "className" : "com.salesforce.hw.OpIrisSimple$$anonfun$5"
        }
      },

Even though one can instantiate back the workflow, the names are still volatile and depend on the order of occurrence. I think we should throw an exception instead.

koertkuipers commented 5 years ago

ok we now have all tests pass in scala 2.12 and are fully caught up with master branch see: https://github.com/tresata-opensource/TransmogrifAI/tree/feat-scala212

tovbinm commented 5 years ago

That’s outstanding!! It would be great to have it merged into our codebase together with a cross build changes for Scala 2.11/2.12.

tovbinm commented 5 years ago

@koertkuipers are there any plans on your end to raise a PR with Scala 2.12 changes? ;)

koertkuipers commented 5 years ago

i don't know gradle well, so to make it cross-compile for 2.11 and 2.12 is probably not something i will get to.

On Sat, Aug 24, 2019, 12:14 Matthew Tovbin notifications@github.com wrote:

@koertkuipers https://github.com/koertkuipers are there any plans on your end to raise a PR with Scala 2.12 changes? ;)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/TransmogrifAI/issues/332?email_source=notifications&email_token=AAGIQJEDYZYWRLIQT73OFF3QGFM47A5CNFSM4HYLISR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CDBOQ#issuecomment-524562618, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGIQJH6JH5RQIOBI3JPAQ3QGFM47ANCNFSM4HYLISRQ .

tovbinm commented 5 years ago

I can help to add it. Let’s start from what you already have.

Helw150 commented 5 years ago

edit: Ah I just saw that you are reverting back to Spark 2.3 so I'll assume this is blocked until the overall repo is ready for 2.4!

@tovbinm I'm not particularly well versed in Gradle, but cross compiled Transmogrifai is something I'm interested in having so I'm happy to spend some time this weekend taking a stab if you don't have free cycles.

Have you had a chance to try from @koertkuipers branch or should I start?

tovbinm commented 5 years ago

We are planning to keep Spark 2.3 with Scala 2.11 in our master branch. Additionally, we would maintain the Spark 2.4 with Scala 2.11/12 branch separately, until the team is ready to move to Spark 2.4.

I havent had a chance to review @koertkuipers branch. Where is it?

koertkuipers commented 5 years ago

We are planning to keep Spark 2.3 with Scala 2.11 in our master branch. Additionally, we would maintain the Spark 2.4 with Scala 2.11/12 branch separately, until the team is ready to move to Spark 2.4.

I havent had a chance to review @koertkuipers branch. Where is it?

its here: https://github.com/tresata-opensource/TransmogrifAI/tree/feat-scala212 the changes are minimal

tovbinm commented 5 years ago

@koertkuipers it seems that you had to recompile xgboost with scala 2.12. would they not accept the PR as well so that we could use the official release?

koertkuipers commented 5 years ago

see: https://github.com/dmlc/xgboost/pull/4574 it has been merged but no release yet

wsuchy commented 4 years ago

@tovbinm @koertkuipers it looks like we've finally got unblocked: https://search.maven.org/artifact/ml.dmlc/xgboost-jvm_2.12

tovbinm commented 4 years ago

Sweet! For a start, we need to start on the cross version build, since I assume you folks are not ready for 2.12 yet.

nicodv commented 4 years ago

Baby steps: I've done some work cross-version building in a branch based on @koertkuipers fork: https://github.com/salesforce/TransmogrifAI/tree/ndv/scala212

I'm using https://github.com/ADTRAN/gradle-scala-multiversion-plugin, which seems to work well.

EDIT: this has been resolved ----> Looks like we're still blocked by MLeap (and the XGBoost component of that) for 2.12: https://github.com/combust/mleap/issues/697

nicodv commented 4 years ago

An observation: the only version of XGBoost that is based on 2.12 (1.2.0) also requires Spark 3 (https://github.com/dmlc/xgboost/releases). So if we want to keep using XGBoost, there may be no way to separate the Scala 2.12 upgrade from the Spark 3 upgrade.