Should SKLearn operators be assumed to produce a single output?

microsoft / hummingbird

Hummingbird compiles trained ML models into tensor computation for faster inference.

MIT License

3.32k stars 274 forks source link

Should SKLearn operators be assumed to produce a single output? #656

Open stillmatic opened 1 year ago

stillmatic commented 1 year ago

See https://github.com/microsoft/hummingbird/blob/main/hummingbird/ml/_parse.py#L256

Consider models which implement predict and predict_proba functions. These return both label and probabilities as outputs. The current logic means that we cannot name the outputs in the hummingbird conversion step (ie with output_names argument to extra_config) and instead have to perform some ONNX graph surgery afterwards.

stillmatic commented 1 year ago

took an extremely silly hard coded approach here: https://github.com/stillmatic/hummingbird/commit/0a2fc96bdb1112607ae61719555424ce1ba80255

interesaaat commented 1 year ago

Hey let me take a stab at this. It requires several changes in order to make it generic, but I agree that it could be the source of the error behind #676.

interesaaat commented 1 year ago

Of course unless you want to try to fix this yourself!

stillmatic commented 1 year ago

yeah, the generic case is quite tricky, hard coding it is much easier ;)

tried to do a cleaner implementation by adding XGBClassifier to the supported sklearn operator map, but that requires changes to onnxmltools upstream, probably. for right now, happy with https://github.com/stillmatic/hummingbird/commit/0a2fc96bdb1112607ae61719555424ce1ba80255 for myself, it's probably too hacky to upstream though.

I don't think it's the problem behind #676, after debugging more, that problem appears to be specifically in the GEMM -> ONNX conversion.

interesaaat commented 1 year ago

@stillmatic can you check if this repo will solve your problem?

stillmatic commented 1 year ago

interesting, this definitely looks like it is properly generating the two variables. on one of my internal models, I'm seeing this output

In [10]: hmclf.model.graph.output
Out[10]:
[name: "variable1"
type {
  tensor_type {
    elem_type: 7
    shape {
      dim {
        dim_param: "sym"
      }
    }
  }
}
, name: "variable2"
type {
  tensor_type {
    elem_type: 11
    shape {
      dim {
        dim_param: "sym"
      }
      dim {
        dim_param: "Concatvariable2_dim_1"
      }
    }
  }
}
]

though I can't generate a reproducible example. it doesn't seem to affect prediction, onnx is happy with it. let me dig a bit more over the next few days!

stillmatic commented 1 year ago

@interesaaat -- sorry for delayed response. I have tested your branch with internal models and everything seems to be working, so think it could be upstreamed. Thanks for the help!