Open koertkuipers opened 5 years ago
Starting from #184 first issue i ran into is gradle-scalastyle-plugin not found for scala 2.12. resolved by adding this maven repo:
maven {
name = "ngbinh"
url = "https://dl.bintray.com/ngbinh/maven"
}
To stop scala compiler from exiting with strange messages like assertion failed: ClassBType.info not yet assigned: Lorg/apache/spark/rdd/RDD;
i removed -optimize
flag
Just a heads up - with merged #274 and with Scala 2.12 compile the majority of the tests will fail, e.g. UnaryLamdaTransformerTest
, due to stage serialization issues. We need to update all lambda functions with concrete classes as explained in #274
Thanks for making the effort to upgrade and support both Scala 2.11/12 versions @koertkuipers !!
after publishing my own in-house versions of xgboost and mleap it seems everything compiles and tests fine using scala 2.12. seemed a little too easy... wonder if i am not running tests correctly.
this is before https://github.com/salesforce/TransmogrifAI/pull/274 i get a bunch of merge conflicts with that one that i need to take a look at. since i didnt make any significant changes the merge conflicts are likely with spark-2.4 branch.
xgboost not having scala version in package name doesnt help
e.g. xgboost4j-spark-0.90.jar
vs xgboost4j-spark_2.11-0.90.jar
i build xgboost 0.90 for scala 2.11/2.12 here and published it to our bintray
i build mleap for scala 2.11/2.12 here and published it also to our bintray
my branch for transmogrifai for scala 2.12 is feat-scala212. its currently set to build for only scala 2.12 since i don't yet know how to do both with gradle in one pass but it also builds fine for scala 2.11 if you simply change scalaVersion
and scalaVersionRevision
Nice! Please pull the latest from https://github.com/salesforce/TransmogrifAI/tree/mt/spark-2.4 since I just updated it.
I would like to try adding a cross build for 2.11/12
i expected to see serialization issues after merging in https://github.com/salesforce/TransmogrifAI/tree/mt/spark-2.4 due to https://github.com/salesforce/TransmogrifAI/pull/274 but i don't :confused:
i see you put a lot of functions in companion objects like:
object BinaryTransformerTest {
def fn: (Real, RealNN) => Real = (i1, i2) => new Real(for {v1 <- i1.value; v2 <- i2.value} yield v1 / (v2 * v2))
}
do you expect this approach to work for scala 2.12?
I think it works only because models are serialized and loaded within the same code. If you train a model, save it then recompile the codebase and try to load the model - it would fail.
@koertkuipers I agree here with @tovbinm, the problem is that in Scala 2.12 all lambda functions are given random names for each time you compile your code. So model trained on your laptop will not load in your production environment due to missing references. We are currently working on the fix for that.
@tovbinm @wsuchy is there any way you can think of to test this? would forking for tests do it?
The test should be constructed a follows:
core/test/resources
like this one. This way we would have a set of models what would contain stages saved from previous JVM runs and would allow us testing if the models would load & produce scores correctly.
I propose to follow the migration guide for 2.12 in #274 and replace all the lambda functions introduced with concrete classes.
after publishing my own in-house versions of xgboost and mleap it seems everything compiles and tests fine using scala 2.12. seemed a little too easy... wonder if i am not running tests correctly.
this is before #274 i get a bunch of merge conflicts with that one that i need to take a look at. since i didnt make any significant changes the merge conflicts are likely with spark-2.4 branch.
i spoke to soon about everything testing fine. i was using compileTestScala
but now i tried test
which seems to run more tests and some fail. so i got some more work to do.
all the tests except for one pass before #274. the only failure before #274 is:
[-] RecordInsightsLOCO should return the most predictive features for data generated with a strong relation to the label (1.33 secs)
com.salesforce.op.stages.impl.insights.RecordInsightsLOCOTest > RecordInsightsLOCO should return the most predictive features for data generated with a strong relation to the label FAILED
org.scalatest.exceptions.TestFailedException: 0.953500197613813 was not less than 0.8 The ratio of feature strengths between important and other features should be similar to the ratio of feature importances from Spark's RandomForest
after #274 i have lots of failures. but this is good news as this was expected and we know how to fix them. the failures look like this:
java.lang.RuntimeException: Failed to write out stage 'uid_1234'
Caused by:
java.lang.RuntimeException: Argument 'transformFn' [com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838] cannot be serialized. Make sure com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838 has either no-args ctor or is an object, and does not have any external dependencies, e.g. use any out of scope variables.
Caused by:
java.lang.RuntimeException: Failed to create an instance of class 'com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838'. Class has to either have a no-args ctor or be an object.
Caused by:
java.lang.ClassNotFoundException: com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838
i tried to convert some lambdas to concrete classes as explained in #274 migration notes.
for example in PassengerFeaturesTest.scala
this:
def genderFn: Passenger => MultiPickList = p => Set(p.getGender).toMultiPickList
i replaced by:
class GenderFn extends Function1[Passenger, MultiPickList] {
def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
}
this did not work. i now get:
java.io.NotSerializableException: com.salesforce.op.test.PassengerFeaturesTestLambdas$GenderFn
however it works if i add Serializable:
class GenderFn extends Function1[Passenger, MultiPickList] with Serializable {
def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
}
i tried to convert some lambdas to concrete classes as explained in #274 migration notes.
for example in
PassengerFeaturesTest.scala
this:def genderFn: Passenger => MultiPickList = p => Set(p.getGender).toMultiPickList
i replaced by:
class GenderFn extends Function1[Passenger, MultiPickList] { def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList }
this did not work. i now get:
java.io.NotSerializableException: com.salesforce.op.test.PassengerFeaturesTestLambdas$GenderFn
however it works if i add Serializable:
class GenderFn extends Function1[Passenger, MultiPickList] with Serializable { def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList }
i always thought scala functions extend serializable. but its a little more complex than that:
scala> class X extends Function1[String, String]{ def apply(s: String) = s + "!" }
defined class X
scala> val x = new X
x: X = <function1>
scala> x.asInstanceOf[Serializable]
java.lang.ClassCastException: X cannot be cast to scala.Serializable
... 32 elided
scala> val y = { (s: String) => s + "!" }
y: String => String = <function1>
scala> y.asInstanceOf[Serializable]
res6: Serializable = <function1>
@koertkuipers I agree here with @tovbinm, the problem is that in Scala 2.12 all lambda functions are given random names for each time you compile your code. So model trained on your laptop will not load in your production environment due to missing references. We are currently working on the fix for that.
@wsuchy @tovbinm is what i am seeing the same issue? or is this something else?
i am seeing:
java.lang.RuntimeException: Failed to write out stage 'uid_1234'
Caused by:
java.lang.RuntimeException: Argument 'transformFn' [com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838] cannot be serialized. Make sure com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838 has either no-args ctor or is an object, and does not have any external dependencies, e.g. use any out of scope variables.
Caused by:
java.lang.RuntimeException: Failed to create an instance of class 'com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838'. Class has to either have a no-args ctor or be an object.
Caused by:
java.lang.ClassNotFoundException: com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838
@koertkuipers the way I see this working is to explicitly wrap lambdas inside a specific object (one lambda per one object) e.g:
trait TMOGLambda extends Serializable{
val f: (_) => _
}
object Lambda1 extends TMOGLambda {
override val f = (x: Int)=> x.toString
}
And then have this supported by serializer to check presence of that object and hardcode it to always point to f
or whatever we name the property holding our lambda.
I think it cleaner to have classes extend FunctionX base classes, i.e.
class GenderFn extends Function1[Passenger, MultiPickList] with Serializable {
def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
}
@tovbinm the problem will be supporting it from macros, as at the time we evaluate them there is type information.
@tovbinm @wsuchy take a look at https://github.com/tresata-opensource/TransmogrifAI/commit/95095ed870116ebcfb9759a3ab04d7679989f5c5 i simply replaced lambdas with concrete classes until all the tests passed. its not pretty but it works.
i am unsure if i have to add more tests for workflow independent model loading, e.g. load model from json files.
Found this out today:
In our hello world example, the code like this:
val sepalLength = FeatureBuilder.Real[Iris].extract(_.sepalLength.toReal).asPredictor
that result in json containing:
"extractFn" : {
"className" : "com.salesforce.hw.OpIrisSimple$$anonfun$5",
"value" : {
"className" : "com.salesforce.hw.OpIrisSimple$$anonfun$5"
}
},
Even though one can instantiate back the workflow, the names are still volatile and depend on the order of occurrence. I think we should throw an exception instead.
ok we now have all tests pass in scala 2.12 and are fully caught up with master branch see: https://github.com/tresata-opensource/TransmogrifAI/tree/feat-scala212
That’s outstanding!! It would be great to have it merged into our codebase together with a cross build changes for Scala 2.11/2.12.
@koertkuipers are there any plans on your end to raise a PR with Scala 2.12 changes? ;)
i don't know gradle well, so to make it cross-compile for 2.11 and 2.12 is probably not something i will get to.
On Sat, Aug 24, 2019, 12:14 Matthew Tovbin notifications@github.com wrote:
@koertkuipers https://github.com/koertkuipers are there any plans on your end to raise a PR with Scala 2.12 changes? ;)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/TransmogrifAI/issues/332?email_source=notifications&email_token=AAGIQJEDYZYWRLIQT73OFF3QGFM47A5CNFSM4HYLISR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CDBOQ#issuecomment-524562618, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGIQJH6JH5RQIOBI3JPAQ3QGFM47ANCNFSM4HYLISRQ .
I can help to add it. Let’s start from what you already have.
edit: Ah I just saw that you are reverting back to Spark 2.3 so I'll assume this is blocked until the overall repo is ready for 2.4!
@tovbinm I'm not particularly well versed in Gradle, but cross compiled Transmogrifai is something I'm interested in having so I'm happy to spend some time this weekend taking a stab if you don't have free cycles.
Have you had a chance to try from @koertkuipers branch or should I start?
We are planning to keep Spark 2.3 with Scala 2.11 in our master branch. Additionally, we would maintain the Spark 2.4 with Scala 2.11/12 branch separately, until the team is ready to move to Spark 2.4.
I havent had a chance to review @koertkuipers branch. Where is it?
We are planning to keep Spark 2.3 with Scala 2.11 in our master branch. Additionally, we would maintain the Spark 2.4 with Scala 2.11/12 branch separately, until the team is ready to move to Spark 2.4.
I havent had a chance to review @koertkuipers branch. Where is it?
its here: https://github.com/tresata-opensource/TransmogrifAI/tree/feat-scala212 the changes are minimal
@koertkuipers it seems that you had to recompile xgboost with scala 2.12. would they not accept the PR as well so that we could use the official release?
see: https://github.com/dmlc/xgboost/pull/4574 it has been merged but no release yet
@tovbinm @koertkuipers it looks like we've finally got unblocked: https://search.maven.org/artifact/ml.dmlc/xgboost-jvm_2.12
Sweet! For a start, we need to start on the cross version build, since I assume you folks are not ready for 2.12 yet.
Baby steps: I've done some work cross-version building in a branch based on @koertkuipers fork: https://github.com/salesforce/TransmogrifAI/tree/ndv/scala212
I'm using https://github.com/ADTRAN/gradle-scala-multiversion-plugin, which seems to work well.
EDIT: this has been resolved ----> Looks like we're still blocked by MLeap (and the XGBoost component of that) for 2.12: https://github.com/combust/mleap/issues/697
An observation: the only version of XGBoost that is based on 2.12 (1.2.0) also requires Spark 3 (https://github.com/dmlc/xgboost/releases). So if we want to keep using XGBoost, there may be no way to separate the Scala 2.12 upgrade from the Spark 3 upgrade.
Spark 2.4 has Scala 2.11 and 2.12 builds. Starting with Spark 3.0 Scala 2.12 will become default and Scala 2.11 will be dropped.
This depends on https://github.com/salesforce/TransmogrifAI/issues/184
This will require: