salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String at com.salesforce.op.features.types.FeatureTypeSparkConverter$$anonfun$2.apply(FeatureTypeSparkConverter.scala:146) #520

Open hjfrank1991 opened 3 years ago

hjfrank1991 commented 3 years ago

when i used iris.csv data:

1,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa

so i create StructType like this:

    val schema = StructType(
      Array(
        StructField("id", IntegerType, nullable = false),
        StructField("sepalLength", DoubleType, nullable = false).withComment("feature"),
        StructField("sepalWidth", DoubleType, nullable = false).withComment("feature"),
        StructField("petalLength", DoubleType, nullable = false).withComment("feature"),
        StructField("petalWidth", DoubleType, nullable = false).withComment("feature"),
        StructField("irisClass", StringType, nullable = false).withComment("label")
      )
    )

next i get label col and feature col:

val dataFrame = ...
val name = "irisClass"
val (irisClass, predictors)  = FeatureBuilder.fromDataFrame[Text](dataFrame, response = name)

id isn't label and feature when use this it means id is also a feature col , but i don't want this; so i select cols comment is label or feature and then i drop other cols

val frame = dataFrame.drop("id")
val (irisClass, predictors)  = FeatureBuilder.fromDataFrame[Text](frame, response = name)

// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder.fromDataFrame[Text](dataFrame, response = name)

// Automated feature engineering
val featureVector = predictors.transmogrify()

// Automated feature validation and selection
val index = survived.indexed("__unknown", StringIndexerHandleInvalid.Keep)

val checkedFeatures = index.sanityCheck(featureVector, removeBadFeatures = true)

val pred = MultiClassificationModelSelector
  //.withCrossValidation()
  .withTrainValidationSplit()
  .setInput(index, checkedFeatures)
  .setOutputFeatureName("pred")
  .getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model: OpWorkflowModel = new OpWorkflow()
  .setInputDataset(frame)
  .setResultFeatures(pred)
  .train()

// save
model.save(path = "/model/automl", overwrite = true)

// load
val loadmodel = OpWorkflowModel.load("/model/automl")

// getAllFeatures
val features = loadmodel.getRawFeatures().map(_.name)

// use model to predict new data 
// Changing the order of columns
val frame3 = frame. select(features.head, features.tail: _*)
val dataFrame1 = loadmodel.setInputDataset(frame3)
  .score()
dataFrame1.show(false)

but get bug:

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
    at com.salesforce.op.features.types.FeatureTypeSparkConverter$$anonfun$2.apply(FeatureTypeSparkConverter.scala:146)
hjfrank1991 commented 3 years ago

if i change this:

val dataFrame1 = loadmodel.setInputDataset(frame)
.score()
dataFrame1.show(false)

it’s ok so when i use model to predict data i cann't change the order of columns ?

tovbinm commented 3 years ago

In your example you seem does not seem to be using the frame you created. Try this:

// Drop id column
val frame = dataFrame.drop("id")

// Extract response and predictor Features
val (irisClass, predictors) = FeatureBuilder.fromDataFrame[Text](frame, response = "irisClass")

// Automated feature engineering
val featureVector = predictors.transmogrify()

// Automated feature validation and selection
val index = irisClass.indexed("__unknown", StringIndexerHandleInvalid.Keep)
val checkedFeatures = index.sanityCheck(featureVector, removeBadFeatures = true)

val pred = MultiClassificationModelSelector
  .withTrainValidationSplit()
  .setInput(index, checkedFeatures)
  .setOutputFeatureName("pred")
  .getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model: OpWorkflowModel = new OpWorkflow()
  .setInputDataset(frame)
  .setResultFeatures(pred)
  .train()

val scored = model.setInputDataset(frame).score()

scored.show(false)
hjfrank1991 commented 3 years ago

sorry !write mistake。。。 this
in idea is right

// Extract response and predictor Features 
val (survived, predictors) = FeatureBuilder.fromDataFrame[Text](frame, response = name)

you example is right but when i change this frame ( change the order of columns rename frame_new) and then use model predict then have bug:

val scored = model.setInputDataset(frame_new).score()

so we predict data should keep the order of columns????

hjfrank1991 commented 3 years ago

and we can use this like sparkml pipeline example:

val (irisClass, predictors1) = FeatureBuilder.fromDataFrame[Text](dataFrame, response = name)
val strindex = new OpStringIndexer()
  .setInput(irisClass)
  .setOutputFeatureName("index")

val strModel = strindex.fit(dataFrame)
val mm = strModel.getSparkMlStage() match {
  case Some ( x ) => x
}

val opdt = new OpDecisionTreeClassifier()
  .setInput(strindex.getOutput(), featureVector1)
  .setOutputFeatureName("dtPred")

val labels = mm.labels

val inde = new OpIndexToString()
  .setInput(strindex.getOutput())
  .setLabels(labels)
  .setOutputFeatureName("pred")

val pipelineModel = new Pipeline("getAlgorithmType")
  .setStages(Array(strindex, opdt, inde))
  .fit(dataFrame)

do you have example like that?

tovbinm commented 3 years ago

We never tried resorting to the columns. In general, this should not be an issue since we refer the columns by their names. Why would you need to do it?

Transmogrify stages can be used in Spark ML pipelines as long as you maintain the naming conventions on the columns.

hjfrank1991 commented 3 years ago

When we train the model, we use this model again to predict a batch of data, but the column order of this batch of data is different, and the column names are the same. If the order of the data columns read by the model cannot be changed, this reduces the generality

tovbinm commented 3 years ago

OK, I just went through the code. Each Feature that was constructed from a Dataframe Row has an index property which is used to locate the feature column in each row.

One option I see to overcome this is to recreate the features prior scoring using the new dataset, then use them as input for the model.

hjfrank1991 commented 3 years ago

I don't quite understand; use new data sets to create features and then use the original model to predict

hjfrank1991 commented 3 years ago

when i use : val features = loadmodel.getRawFeatures().map(_.name) the order also changed