salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

java.lang.AbstractMethodError when initializing Splitter #469

Closed dzlab closed 4 years ago

dzlab commented 4 years ago

Describe the bug When training on a binary-classification example, I hit a java.lang.AbstractMethodError error when the library code tries to initialize a Splitter.

To Reproduce This is the code I'm using, it's primarily based on the training example of binary classification on Titanic.

object TransmogrifAI {

  def trainBinaryClassClassifier(df: DataFrame, target: String)(implicit spark: SparkSession): OpWorkflowModel = {
    // Extract response and predictor features
    val (survived, predictors) = FeatureBuilder.fromDataFrame[RealNN](df, response = target)

    // Automated feature engineering
    val featureVector = predictors.transmogrify()

    // Automated feature validation and selection
    val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)

    // Automated model selection
    val pred = BinaryClassificationModelSelector
      .withTrainValidationSplit(modelTypesToUse = Seq(OpLogisticRegression))
      .setInput(survived, checkedFeatures)
      .getOutput()

    // Setting up a TransmogrifAI workflow and training the model
    val model = new OpWorkflow().setInputDataset(df).setResultFeatures(pred).train()
    model
  }

  def predictBinaryClassClassifier(df: DataFrame, model: OpWorkflowModel)(implicit spark: SparkSession): DataFrame = {
    model.setInputDataset(df).score()
  }

  def main(args: Array[String]): Unit = {

    import org.apache.spark.sql.functions.col
    implicit val spark = Spark.sparkSession
    import spark.implicits._

    val reader = spark.read.format("csv")
      .options(Map("inferSchema"->"true","delimiter"->",", "header"->"true"))
    // Read Titanic data as a DataFrame
    val trainDF = reader.load("/tmp/titanic-train.csv").
      withColumn("survived", col("survived").cast(DoubleType))

    val predictDF = reader.load("/tmp/titanic-predict.csv")

    val model = trainBinaryClassClassifier(trainDF, "survived")
    println("Model summary:\n" + model.summaryPretty())
    val output = predictBinaryClassClassifier(predictDF, model)
    println("Prediction:\n" + output.collect().map(_.mkString(",")).mkString("\n"))
  }
}

Expected behavior Expect the model be trained successfully without crashing.

Logs or screenshots Output logs when trying to run the main() function in the example above

Exception in thread "main" java.lang.AbstractMethodError
    at org.apache.spark.ml.param.Params$class.$init$(params.scala:868)
    at com.salesforce.op.stages.impl.tuning.Splitter.<init>(Splitter.scala:47)
    at com.salesforce.op.stages.impl.tuning.DataSplitter.<init>(DataSplitter.scala:62)
    at com.salesforce.op.stages.impl.tuning.DataSplitter$.apply(DataSplitter.scala:51)
    at com.salesforce.op.stages.impl.classification.BinaryClassificationModelSelector$.withTrainValidationSplit$default$1(BinaryClassificationModelSelector.scala:211)
    at dzlab.automl.TransmogrifAI$.trainBinaryClassClassifier(TransmogrifAI.scala:40)
    at dzlab.automl.TransmogrifAI$.main(TransmogrifAI.scala:73)
    at dzlab.automl.TransmogrifAI.main(TransmogrifAI.scala)

Additional context Add any other context about the problem here.

val sparkVersion = "2.4.0"

  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "org.apache.spark" %% "spark-hive" % sparkVersion,
  "org.apache.spark" %% "spark-mllib" % sparkVersion,
  "org.apache.spark" %% "spark-hive-thriftserver" % sparkVersion

  "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.6.1"
tovbinm commented 4 years ago

This problem occurs when Spark versions used in your project and TransmogrifAI mismatch. Please use Spark 2.3.x with TransmogrifAI 0.6.1.

If you would like to use Spark 2.4.x you would need to pull master branch, compile and publish version 0.7.0-SNAPSHOT locally, i.e. ./gradlew publishToMavenLocal

dzlab commented 4 years ago

Thanks for the hint, I will probably build it locally. What's the plan for supporting spark 2.4? is it going to be in a soon future?

tovbinm commented 4 years ago

Starting next release, which hopefully would happen soon.

tovbinm commented 4 years ago

TransmogrifAI 0.7.0 now supports Spark 2.4