nok / sklearn-porter

Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
BSD 3-Clause "New" or "Revised" License
1.28k stars 170 forks source link

C code generator with multi output RFC - illegal code generated and general failure to handle multi dimension output #56

Open mg169706 opened 5 years ago

mg169706 commented 5 years ago

I'm creating a Random Forest Classifier that features 248 inputs and 108 outputs. Based on the Boolean state of each input the 108 outputs will be on or off (They represent valves). The value of these discreet output states is what the system has learned. There are two issues I'm having with this:

  1. The code generator only seems to create trees for one output, and I don't know which one. For each output I'd expect a separate set of trees, because the inputs remain the same, but the decision tree for each valve's state will be different.
  2. The code for the single output generates invalid C. See below for example code fragment.

    `int predict_0(float features[]) { int classes[[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]];

    if (features[181] <= 0.5) { ... } }`

MatInGit commented 3 years ago

Any Updates on this issue? I have the same problem.

MatInGit commented 3 years ago

Ok, I found a workaround of sorts. You can use sklearn.multioutput.MultiOutputClassifier to create a Classifier for each output, then export each .estimator_ of the multi-output classifier as a separate classifier. It does mean you have to modify the C code a little bit as you now have multiple separate classifiers.

HannanKan commented 2 years ago

Ok, I found a workaround of sorts. You can use sklearn.multioutput.MultiOutputClassifier to create a Classifier for each output, then export each .estimator_ of the multi-output classifier as a separate classifier. It does mean you have to modify the C code a little bit as you now have multiple separate classifiers.

I am afraid it is not a good idea. 108 outputs will take a lot of labor.