Closed rbarbudo closed 3 years ago
@rbarbudo, thanks for trying ML-Plan and for reaching out to us. Sorry for the delay, we're still half on holidays.
First, the fact that you get the same results after a timeout of 1min even using different seeds is not surprising. ML-Plan does not initialize with a random search but has a list of learners it will try (in their default configuration) prior to further exploration. Typically, within 1m this list cannot be fully processed, so you will always get similar results regardless the seed (the only difference then will be the splits for the train/validation folds). For a 10min run I would expect deviating results though; can you enable logs and check the candidates ML-Plan is evaluating?
The phenomenon you describe about the returned classifier could potentially be a bug. Sincerly we have focused much more on WEKA to date than on sklearn, and this could be some issue with the wrapper. Please give us some days to reproduce the issue and check what's going on.
Status update: We are still after it. It is definitely a bug, but it is not so easy to locate its source. We currently believe that it has nothing to do with ML-Plan but with the way how predictions are communicated between the Python side and the Java side.
If you have no particular urge to use ML-Plan with sklearn, we recommend to use the WEKA version meanwhile, which is much more tested and stable.
We hope to have this fixed in the next 14 days though.
Thank you very much for your time. I'll use the WEKA version in the meantime
Hi @rbarbudo ,
sorry for the ridiculous delay in this process. We had to reopen very deep lying modules to repair this problem, which was not at all associated with ML-Plan itself. After several tests I can confirm to you that the newest release 0.2.5 has the issue resolved. I adjusted your code example slightly, and can get meaningful results. Please observe that a few interfaces have changed in the new version.
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.api4.java.ai.ml.classification.singlelabel.evaluation.ISingleLabelClassification;
import org.api4.java.ai.ml.core.dataset.supervised.ILabeledDataset;
import org.api4.java.ai.ml.core.dataset.supervised.ILabeledInstance;
import org.api4.java.ai.ml.core.evaluation.IPrediction;
import org.api4.java.ai.ml.core.evaluation.execution.ILearnerRunReport;
import org.api4.java.algorithm.Timeout;
import ai.libs.jaicore.logging.LoggerUtil;
import ai.libs.jaicore.ml.classification.loss.dataset.EClassificationPerformanceMeasure;
import ai.libs.jaicore.ml.core.dataset.serialization.OpenMLDatasetReader;
import ai.libs.jaicore.ml.core.evaluation.evaluator.SupervisedLearnerExecutor;
import ai.libs.jaicore.ml.core.filter.SplitterUtil;
import ai.libs.jaicore.ml.scikitwrapper.IScikitLearnWrapper;
import ai.libs.mlplan.core.MLPlan;
import ai.libs.mlplan.sklearn.builder.MLPlanScikitLearnBuilder;
public class LaunchMLPlan {
public static void main(final String[] args) throws Exception
{
String dataset = args[0]; // in fact ignored in this example
int seed = Integer.parseInt(args[1]);
int budget = Integer.parseInt(args[2]);
ILabeledDataset<ILabeledInstance> d = new OpenMLDatasetReader().deserializeDataset(40975);
List<ILabeledDataset<ILabeledInstance>> split = SplitterUtil.getLabelStratifiedTrainTestSplit(d, seed, .7);
ILabeledDataset <ILabeledInstance> dTrain = split.get(0);
ILabeledDataset <ILabeledInstance> dTest = split.get(1);
// get the list of labels
String labels = dTrain.getLabelAttribute().getStringDescriptionOfDomain();
labels = labels.replace("[", "").replace("]", "").replace(" ", "");
String[] labelList = labels.split(",");
long start = System.currentTimeMillis();
MLPlanScikitLearnBuilder builder = MLPlanScikitLearnBuilder.forClassification();
// set the number of cores
builder.withNumCpus(1);
// set the seed
builder.withSeed(seed);
// set the global timeout of ML-Plan
builder.withTimeOut(new Timeout(budget, TimeUnit.SECONDS));
// set the timeout of a single solution candidate
builder.withNodeEvaluationTimeOut(new Timeout(budget/10, TimeUnit.SECONDS));
builder.withCandidateEvaluationTimeOut(new Timeout(budget/10, TimeUnit.SECONDS));
System.out.println(builder.getSearchSpaceConfigFile());
System.out.println(builder.getAlgorithmConfig());
// ??
builder.withPortionOfDataReservedForSelection(.0);
// start the optimization process
MLPlan<IScikitLearnWrapper> mlplan = builder.withDataset(dTrain).build();
mlplan.setLoggerName(LoggerUtil.LOGGER_NAME_TESTEDALGORITHM);
IScikitLearnWrapper classifier = mlplan.call();
long end = System.currentTimeMillis();
float sec = (end - start) / 1000F;
BufferedWriter bw = new BufferedWriter(new FileWriter("runtime" + File.separator + dataset + "_" + seed + ".txt"));
bw.write(sec + " seconds\n");
bw.close();
// show the resulting model and its performance
SupervisedLearnerExecutor executor = new SupervisedLearnerExecutor();
ILearnerRunReport report = executor.execute(classifier, dTest);
System.out.println("Chosen model is: " + mlplan.getSelectedClassifier());
System.out.println("Error Rate of the solution produced by ML-Plan: " +
EClassificationPerformanceMeasure.ERRORRATE.loss(report.getPredictionDiffList().getCastedView(Integer.class, ISingleLabelClassification.class)));
// use the resulting model for prediction and store it in a file
bw = new BufferedWriter(new FileWriter("predictions" + File.separator + dataset + "_" + seed + ".csv"));
bw.write("y_pred\n");
for(IPrediction prediction: classifier.predict(dTest).getPredictions()) {
bw.write(labelList[(int) prediction.getPrediction()] + "\n");
}
bw.close();
}
}
Output is
Chosen model is: LogisticRegression(C=1.0,dual=False,penalty="l2")
Error Rate of the solution produced by ML-Plan: 0.07722007722007722
I hope that this resolves the doubts. In case that there is another problem, please open a new issue, or we re-open this one.
Whenever I run ML-Plan, it always returns a
DummyClassifier
or aGaussianNB
when I call thegetSelectedClassifier()
method. I have tried two different budgets (1 and 10 min) and different seeds too. However, I always get the same results.I have followed the installation instructions, although I had to modify the
pom.xml
file:I work with Eclipse and I have created a simple project with only a class:
It is worth noting that I have copied to my project directory the .json files configuring the search space:
builder.withSearchSpaceConfigFile(new File("./automl/searchmodels/sklearn/sklearn-classification.json"));
I have also tried to modify the list of applicable classifiers to only select the
DecisionTreeClassifier
, which makes that thegetSelectedClassifier()
method returns such a classifier. However, the results it achieves are very bad (i.e. error rate closer to 0.8). I have created a simple python script to train a DecisionTree with the configuration returned by ML-Plan (DecisionTreeClassifier(criterion="gini",max_depth=6,min_samples_split=11,min_samples_leaf=11)
) over the same data partition and it returns much better results.Am I misunderstanding something about the use of ML-Plan? Thank you in advance.
Train and test files: car.zip