Open S-C-H opened 5 years ago
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
@S-C-H sorry about the trouble you are having, this looks like a bug in ComputeModelStatistics based on the logs, although I'm not exactly sure on what the issue is. It looks like the categorical levels to index map encountered an unexpected categorical value. I will try to repro the issue locally based on the notebook and the snippet you sent above - is the dataset the same as in the notebook or are you using a private dataset?
at com.microsoft.ml.spark.train.ComputeModelStatistics$$anonfun$getPredictionAndLabels$1.apply(ComputeModelStatistics.scala:286)
at com.microsoft.ml.spark.train.ComputeModelStatistics$$anonfun$getPredictionAndLabels$1.apply(ComputeModelStatistics.scala:285)
@imatiach-msft
Thanks for looking into this! It is a private dataset with a rather large number of features in a PySpark feature vector (VectorAssembler). I did make sure to shift categorical variables to the beginning of the vector.
@imatiach-msft
I reconfigured my code to use TrainClassifier. (as an aside what is the best way of doing an ngram range -> Text Featuriser on char grams?)
I attempted to build a single model and use ComputeModelStatistics.
For prediction I am getting:
Odd that it is an Int and Double... Does this impact?(I did try changing but I don't think that's possible after using ValueIndexer).
metrics = ComputeModelStatistics(evaluationMetric="classification").transform(prediction)
metrics.select('accuracy').show()
I run this and get key errors.
key not found: 24
Despite the fact that the sets of labels are identical:
And I use ValueIndexer:
value_indexer = ValueIndexer(inputCol="CLASS", outputCol="label").fit(traindata)
data = value_indexer.transform(traindata)
test = value_indexer.transform(testdata)
log4j.txt
Following this example: https://github.com/Azure/mmlspark/blob/master/notebooks/samples/HyperParameterTuning%20-%20Fighting%20Breast%20Cancer.ipynb
Setting numFolds etc. larger simply increases the time to throw an exception.
The data is in a PySpark style format. A huge feature vector + a ValueIndexed label. Hence I do not call TrainClassifier on the models. I did attempt to use TrainClassifier (which can train a single model) but it still throws the same error.