microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.04k stars 830 forks source link

TuneHyperparameters - Exception thrown in awaitResult #667

Open S-C-H opened 5 years ago

S-C-H commented 5 years ago

log4j.txt

Following this example: https://github.com/Azure/mmlspark/blob/master/notebooks/samples/HyperParameterTuning%20-%20Fighting%20Breast%20Cancer.ipynb

from mmlspark.automl import TuneHyperparameters
from mmlspark.train import TrainClassifier
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
logReg = LogisticRegression()
randForest = RandomForestClassifier()
smlmodels = [logReg, randForest]

from mmlspark.automl import *

paramBuilder = \
  HyperparamBuilder() \
    .addHyperparam(logReg, logReg.regParam, RangeHyperParam(0.1, 0.3)) \
    .addHyperparam(randForest, randForest.numTrees, DiscreteHyperParam([5,10])) \

searchSpace = paramBuilder.build()
# The search space is a list of params to tuples of estimator and hyperparam
print(searchSpace)
randomSpace = RandomSpace(searchSpace)

bestModel = TuneHyperparameters(
              evaluationMetric="accuracy", models=smlmodels, numFolds=1,
              numRuns=len(smlmodels) * 1, parallelism=1,
              paramSpace=randomSpace.space(), seed=0).fit(data.select("features", "label"))

Setting numFolds etc. larger simply increases the time to throw an exception.

The data is in a PySpark style format. A huge feature vector + a ValueIndexed label. Hence I do not call TrainClassifier on the models. I did attempt to use TrainClassifier (which can train a single model) but it still throws the same error.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 36.0 failed 4 times, most recent failure: Lost task 22.3 in stage 36.0 (TID 843, 10.179.68.7, executor 2): java.util.NoSuchElementException: key not found: 85

welcome[bot] commented 5 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

imatiach-msft commented 5 years ago

@S-C-H sorry about the trouble you are having, this looks like a bug in ComputeModelStatistics based on the logs, although I'm not exactly sure on what the issue is. It looks like the categorical levels to index map encountered an unexpected categorical value. I will try to repro the issue locally based on the notebook and the snippet you sent above - is the dataset the same as in the notebook or are you using a private dataset?

    at com.microsoft.ml.spark.train.ComputeModelStatistics$$anonfun$getPredictionAndLabels$1.apply(ComputeModelStatistics.scala:286)
    at com.microsoft.ml.spark.train.ComputeModelStatistics$$anonfun$getPredictionAndLabels$1.apply(ComputeModelStatistics.scala:285)
S-C-H commented 5 years ago

@imatiach-msft

Thanks for looking into this! It is a private dataset with a rather large number of features in a PySpark feature vector (VectorAssembler). I did make sure to shift categorical variables to the beginning of the vector.

S-C-H commented 5 years ago

@imatiach-msft

I reconfigured my code to use TrainClassifier. (as an aside what is the best way of doing an ngram range -> Text Featuriser on char grams?)

I attempted to build a single model and use ComputeModelStatistics.

For prediction I am getting:

image

Odd that it is an Int and Double... Does this impact?(I did try changing but I don't think that's possible after using ValueIndexer).

metrics = ComputeModelStatistics(evaluationMetric="classification").transform(prediction)
metrics.select('accuracy').show()

I run this and get key errors.

key not found: 24

Despite the fact that the sets of labels are identical: image

And I use ValueIndexer:

value_indexer = ValueIndexer(inputCol="CLASS", outputCol="label").fit(traindata)
data = value_indexer.transform(traindata)
test = value_indexer.transform(testdata)