Closed krzischp closed 3 years ago
You can use the .computeDataUpTo
method on OpWorkflow
(instead of .train
) for this.
Thanks for replying quickly! So I wrote
val df = new OpWorkflow()
.setInputDataset(passengersData)
.setResultFeatures(pred)
.computeDataUpTo(checkedFeatures)
But as I get a dataframe instead of a model (in the previous code), I cannot access the feature importances nor the statistics that I would access using
model.modelInsights(pred)
or
val metadata = fittedWorkflow.getOriginStageOf(checkedFeatures).getMetadata()
val summaryData = SanityCheckerSummary.fromMetadata(metadata.getSummaryMetadata())
Is there a way to get those informations (selected features + correlation statistics, etc) with no model training?
I'm not sure this is possible. Going through ModelInsights
is by far the easiest solution.
In order to reduce training times, you can override and reduce the grid of models and hyperparameters trained, as such:
val lr = new OpLogisticRegression()
val models = Seq(lr -> new ParamGridBuilder().addGrid(lr.regParam, Array(0.1)).build())
BinaryClassificationModelSelector.withCrossValidation(modelsAndParameters = models)
// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)
// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures( checkedFeatures).train()
println("Model summary:\n" + model.modelInsights(checkedFeatures))
Should do it - basically the feature you pass in as a result will be the final one computed up to and you can get insights up to whatever level the DAG computes.
Basically the workflow will compute the dag necessary to produce the feature passed in as a resultFeature. If you change the result feature to be the output of the sanityChecker it will only do those computations.
Model Insights will work without a model run (the model part will just be empty. If you then want to add in a model using the already fit sanity checker you can do it like this:
Thank you, unfortunately, I'm confronted with another issue.
I had the same problem as the one met in this issue https://github.com/salesforce/TransmogrifAI/issues/540 (I'm running on Cloudera with this Spark version: 2.4.0-cdh6.3.4) So I tried to overwrite the Jackson module scala dependency with the last one that implements EitherModule (2.7.3):
My build.sbt file
name := "test-transmogrif"
version := "1.0"
scalaVersion := "2.11.12"
resolvers += Resolver.jcenterRepo
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0" % "provided",
"org.apache.spark" %% "spark-mllib" % "2.4.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.4.0" % "provided",
"com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.7.0",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.7.3"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
My running script:
val lr = new OpLogisticRegression()
val models = Seq(lr -> new ParamGridBuilder().addGrid(lr.regParam, Array(0.1)).build())
val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(modelsAndParameters = models).setInput(target, checkedFeatures).getOutput()
val workflow = new OpWorkflow().setInputDataset(dataReader).setResultFeatures(prediction)
val model = workflow.train()
My project/plugins.sbt file
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
But I had this log when I ran spark-submit on the mounted jar (with sbt assembly) at the workflow.train() step:
Exception in thread "main" java.lang.IncompatibleClassChangeError: Class com.fasterxml.jackson.module.scala.OpDefaultScalaModule$ does not implement the requested interface com.fasterxml.jackson.module.scala.modifiers.SeqTypeModifierModule
at com.fasterxml.jackson.module.scala.modifiers.SeqTypeModifierModule$class.$init$(SeqTypeModifierModule.scala:10)
at com.fasterxml.jackson.module.scala.OpDefaultScalaModule.<init>(OpDefaultScalaModule.scala:28)
at com.fasterxml.jackson.module.scala.OpDefaultScalaModule$.<init>(OpDefaultScalaModule.scala:58)
at com.fasterxml.jackson.module.scala.OpDefaultScalaModule$.<clinit>(OpDefaultScalaModule.scala)
at com.salesforce.op.utils.json.JsonUtils$.configureMapper(JsonUtils.scala:159)
at com.salesforce.op.utils.json.JsonUtils$.com$salesforce$op$utils$json$JsonUtils$$jsonMapper(JsonUtils.scala:133)
at com.salesforce.op.utils.json.JsonUtils$.toJsonString(JsonUtils.scala:97)
at com.salesforce.op.utils.json.JsonLike$class.toJson(JsonUtils.scala:179)
at com.salesforce.op.evaluators.BinaryClassificationMetrics.toJson(OpBinaryClassificationEvaluator.scala:179)
at com.salesforce.op.utils.json.JsonLike$class.toString(JsonUtils.scala:186)
at com.salesforce.op.evaluators.BinaryClassificationMetrics.toString(OpBinaryClassificationEvaluator.scala:179)
at com.salesforce.op.evaluators.OpBinaryClassificationEvaluator.evaluateAll(OpBinaryClassificationEvaluator.scala:120)
at com.salesforce.op.evaluators.OpBinaryClassificationEvaluator.evaluateAll(OpBinaryClassificationEvaluator.scala:56)
at com.salesforce.op.stages.impl.selector.HasEval$$anonfun$1.apply(ModelSelectorNames.scala:94)
at com.salesforce.op.stages.impl.selector.HasEval$$anonfun$1.apply(ModelSelectorNames.scala:91)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at com.salesforce.op.stages.impl.selector.HasEval$class.evaluate(ModelSelectorNames.scala:91)
at com.salesforce.op.stages.impl.selector.ModelSelector.evaluate(ModelSelector.scala:71)
at com.salesforce.op.stages.impl.selector.ModelSelector.fit(ModelSelector.scala:166)
at com.salesforce.op.stages.impl.selector.ModelSelector.fit(ModelSelector.scala:71)
at com.salesforce.op.utils.stages.FitStagesUtil$$anonfun$20.apply(FitStagesUtil.scala:264)
at com.salesforce.op.utils.stages.FitStagesUtil$$anonfun$20.apply(FitStagesUtil.scala:263)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at com.salesforce.op.utils.stages.FitStagesUtil$.com$salesforce$op$utils$stages$FitStagesUtil$$fitAndTransformLayer(FitStagesUtil.scala:263)
...
So transmogrifai is built on spark '2.4.5'. And the best way to deal with this is to try explicitly excluding the dependency you dont want to pull in.
Describe the bug Is there a way to get a feature selection result without training the classifiers (BinaryClassificationModelSelector)?
To Reproduce
Expected behavior According to the TransmogrifAI documentation, the feature selection is Univariate and not Embedded: least performant features are identified by their correlation to the target variable + by their low variance, etc. So no model should be necessary.
However, in order to get the selected features result, It seems that I have to pass the checkedFeature to a BinaryClassificationModelSelector object and then to pass it to a Workflow object so I can train it and get the features ranking result. Is there not an alternative, so that I can avoid this model training and optimization step, to get my features importance?
Indeed, if I only want to use this feature selection functionality to present the feature selection explanation to the user, the models training step is useless and really long.