Open hjfrank1991 opened 3 years ago
if i change this:
val dataFrame1 = loadmodel.setInputDataset(frame)
.score()
dataFrame1.show(false)
it’s ok so when i use model to predict data i cann't change the order of columns ?
In your example you seem does not seem to be using the frame
you created. Try this:
// Drop id column
val frame = dataFrame.drop("id")
// Extract response and predictor Features
val (irisClass, predictors) = FeatureBuilder.fromDataFrame[Text](frame, response = "irisClass")
// Automated feature engineering
val featureVector = predictors.transmogrify()
// Automated feature validation and selection
val index = irisClass.indexed("__unknown", StringIndexerHandleInvalid.Keep)
val checkedFeatures = index.sanityCheck(featureVector, removeBadFeatures = true)
val pred = MultiClassificationModelSelector
.withTrainValidationSplit()
.setInput(index, checkedFeatures)
.setOutputFeatureName("pred")
.getOutput()
// Setting up a TransmogrifAI workflow and training the model
val model: OpWorkflowModel = new OpWorkflow()
.setInputDataset(frame)
.setResultFeatures(pred)
.train()
val scored = model.setInputDataset(frame).score()
scored.show(false)
sorry !write mistake。。。 this
in idea is right
// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder.fromDataFrame[Text](frame, response = name)
you example is right but when i change this frame ( change the order of columns rename frame_new) and then use model predict then have bug:
val scored = model.setInputDataset(frame_new).score()
so we predict data should keep the order of columns????
and we can use this like sparkml pipeline example:
val (irisClass, predictors1) = FeatureBuilder.fromDataFrame[Text](dataFrame, response = name)
val strindex = new OpStringIndexer()
.setInput(irisClass)
.setOutputFeatureName("index")
val strModel = strindex.fit(dataFrame)
val mm = strModel.getSparkMlStage() match {
case Some ( x ) => x
}
val opdt = new OpDecisionTreeClassifier()
.setInput(strindex.getOutput(), featureVector1)
.setOutputFeatureName("dtPred")
val labels = mm.labels
val inde = new OpIndexToString()
.setInput(strindex.getOutput())
.setLabels(labels)
.setOutputFeatureName("pred")
val pipelineModel = new Pipeline("getAlgorithmType")
.setStages(Array(strindex, opdt, inde))
.fit(dataFrame)
do you have example like that?
We never tried resorting to the columns. In general, this should not be an issue since we refer the columns by their names. Why would you need to do it?
Transmogrify stages can be used in Spark ML pipelines as long as you maintain the naming conventions on the columns.
When we train the model, we use this model again to predict a batch of data, but the column order of this batch of data is different, and the column names are the same. If the order of the data columns read by the model cannot be changed, this reduces the generality
OK, I just went through the code. Each Feature that was constructed from a Dataframe Row has an index
property which is used to locate the feature column in each row.
One option I see to overcome this is to recreate the features prior scoring using the new dataset, then use them as input for the model.
I don't quite understand; use new data sets to create features and then use the original model to predict
when i use : val features = loadmodel.getRawFeatures().map(_.name) the order also changed
when i used iris.csv data:
so i create StructType like this:
next i get label col and feature col:
id isn't label and feature when use this it means id is also a feature col , but i don't want this; so i select cols comment is label or feature and then i drop other cols
but get bug: