salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

How to use feature selection with no model training and optimization? #541

Closed krzischp closed 3 years ago

krzischp commented 3 years ago

Describe the bug Is there a way to get a feature selection result without training the classifiers (BinaryClassificationModelSelector)?

To Reproduce

// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)

// Automated model selection
val pred = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(pred).train()

println("Model summary:\n" + model.summaryPretty())

Expected behavior According to the TransmogrifAI documentation, the feature selection is Univariate and not Embedded: least performant features are identified by their correlation to the target variable + by their low variance, etc. So no model should be necessary.

However, in order to get the selected features result, It seems that I have to pass the checkedFeature to a BinaryClassificationModelSelector object and then to pass it to a Workflow object so I can train it and get the features ranking result. Is there not an alternative, so that I can avoid this model training and optimization step, to get my features importance?

Indeed, if I only want to use this feature selection functionality to present the feature selection explanation to the user, the models training step is useless and really long.

nicodv commented 3 years ago

You can use the .computeDataUpTo method on OpWorkflow (instead of .train) for this.

krzischp commented 3 years ago

Thanks for replying quickly! So I wrote

val df = new OpWorkflow()
  .setInputDataset(passengersData)
  .setResultFeatures(pred)
  .computeDataUpTo(checkedFeatures)

But as I get a dataframe instead of a model (in the previous code), I cannot access the feature importances nor the statistics that I would access using

model.modelInsights(pred)

or

val metadata = fittedWorkflow.getOriginStageOf(checkedFeatures).getMetadata()
val summaryData = SanityCheckerSummary.fromMetadata(metadata.getSummaryMetadata())

Is there a way to get those informations (selected features + correlation statistics, etc) with no model training?

nicodv commented 3 years ago

I'm not sure this is possible. Going through ModelInsights is by far the easiest solution.

In order to reduce training times, you can override and reduce the grid of models and hyperparameters trained, as such:

val lr = new OpLogisticRegression()
val models = Seq(lr -> new ParamGridBuilder().addGrid(lr.regParam, Array(0.1)).build())
BinaryClassificationModelSelector.withCrossValidation(modelsAndParameters = models)
leahmcguire commented 3 years ago
// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)

// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures( checkedFeatures).train()

println("Model summary:\n" + model.modelInsights(checkedFeatures))

Should do it - basically the feature you pass in as a result will be the final one computed up to and you can get insights up to whatever level the DAG computes.
leahmcguire commented 3 years ago

Basically the workflow will compute the dag necessary to produce the feature passed in as a resultFeature. If you change the result feature to be the output of the sanityChecker it will only do those computations.

leahmcguire commented 3 years ago

Model Insights will work without a model run (the model part will just be empty. If you then want to add in a model using the already fit sanity checker you can do it like this:

https://github.com/salesforce/TransmogrifAI/blob/master/core/src/test/scala/com/salesforce/op/OpWorkflowTest.scala#L336

krzischp commented 3 years ago

Thank you, unfortunately, I'm confronted with another issue.

I had the same problem as the one met in this issue https://github.com/salesforce/TransmogrifAI/issues/540 (I'm running on Cloudera with this Spark version: 2.4.0-cdh6.3.4) So I tried to overwrite the Jackson module scala dependency with the last one that implements EitherModule (2.7.3):

My build.sbt file

name := "test-transmogrif"

version := "1.0"

scalaVersion := "2.11.12"

resolvers += Resolver.jcenterRepo

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.4.0" % "provided",
  "org.apache.spark" %% "spark-mllib" % "2.4.0" % "provided",
  "org.apache.spark" %% "spark-sql" % "2.4.0" % "provided",
  "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.7.0",
  "com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.7.3"
)

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

My running script:

val lr = new OpLogisticRegression()
val models = Seq(lr -> new ParamGridBuilder().addGrid(lr.regParam, Array(0.1)).build())
val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(modelsAndParameters = models).setInput(target, checkedFeatures).getOutput()
val workflow = new OpWorkflow().setInputDataset(dataReader).setResultFeatures(prediction)
val model = workflow.train()

My project/plugins.sbt file

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")

But I had this log when I ran spark-submit on the mounted jar (with sbt assembly) at the workflow.train() step:

Exception in thread "main" java.lang.IncompatibleClassChangeError: Class com.fasterxml.jackson.module.scala.OpDefaultScalaModule$ does not implement the requested interface com.fasterxml.jackson.module.scala.modifiers.SeqTypeModifierModule

        at com.fasterxml.jackson.module.scala.modifiers.SeqTypeModifierModule$class.$init$(SeqTypeModifierModule.scala:10)

        at com.fasterxml.jackson.module.scala.OpDefaultScalaModule.<init>(OpDefaultScalaModule.scala:28)

        at com.fasterxml.jackson.module.scala.OpDefaultScalaModule$.<init>(OpDefaultScalaModule.scala:58)

        at com.fasterxml.jackson.module.scala.OpDefaultScalaModule$.<clinit>(OpDefaultScalaModule.scala)

        at com.salesforce.op.utils.json.JsonUtils$.configureMapper(JsonUtils.scala:159)

        at com.salesforce.op.utils.json.JsonUtils$.com$salesforce$op$utils$json$JsonUtils$$jsonMapper(JsonUtils.scala:133)

        at com.salesforce.op.utils.json.JsonUtils$.toJsonString(JsonUtils.scala:97)

        at com.salesforce.op.utils.json.JsonLike$class.toJson(JsonUtils.scala:179)

        at com.salesforce.op.evaluators.BinaryClassificationMetrics.toJson(OpBinaryClassificationEvaluator.scala:179)

        at com.salesforce.op.utils.json.JsonLike$class.toString(JsonUtils.scala:186)

        at com.salesforce.op.evaluators.BinaryClassificationMetrics.toString(OpBinaryClassificationEvaluator.scala:179)

        at com.salesforce.op.evaluators.OpBinaryClassificationEvaluator.evaluateAll(OpBinaryClassificationEvaluator.scala:120)

        at com.salesforce.op.evaluators.OpBinaryClassificationEvaluator.evaluateAll(OpBinaryClassificationEvaluator.scala:56)

        at com.salesforce.op.stages.impl.selector.HasEval$$anonfun$1.apply(ModelSelectorNames.scala:94)

        at com.salesforce.op.stages.impl.selector.HasEval$$anonfun$1.apply(ModelSelectorNames.scala:91)

        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

        at scala.collection.immutable.List.foreach(List.scala:392)

        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

        at scala.collection.immutable.List.map(List.scala:296)

        at com.salesforce.op.stages.impl.selector.HasEval$class.evaluate(ModelSelectorNames.scala:91)

        at com.salesforce.op.stages.impl.selector.ModelSelector.evaluate(ModelSelector.scala:71)

        at com.salesforce.op.stages.impl.selector.ModelSelector.fit(ModelSelector.scala:166)

        at com.salesforce.op.stages.impl.selector.ModelSelector.fit(ModelSelector.scala:71)

        at com.salesforce.op.utils.stages.FitStagesUtil$$anonfun$20.apply(FitStagesUtil.scala:264)

        at com.salesforce.op.utils.stages.FitStagesUtil$$anonfun$20.apply(FitStagesUtil.scala:263)

        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)

        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)

        at com.salesforce.op.utils.stages.FitStagesUtil$.com$salesforce$op$utils$stages$FitStagesUtil$$fitAndTransformLayer(FitStagesUtil.scala:263)
...
leahmcguire commented 3 years ago

So transmogrifai is built on spark '2.4.5'. And the best way to deal with this is to try explicitly excluding the dependency you dont want to pull in.