siboehm / lleaves

Compiler for LightGBM gradient-boosted trees, based on LLVM. Speeds up prediction by ≥10x.
https://lleaves.readthedocs.io/en/latest/
MIT License
364 stars 29 forks source link

Return predictions for raw_score=True #7

Closed jtilly closed 3 years ago

jtilly commented 3 years ago

We often fit GBMs with base margins (in LightGBM they're called init_score). We also need to supply these base margins during predict (for which LightGBM's predict method doesn't supply a convenient way). So in practice we use predict with raw_score =True, add the applicable base margin, and then apply the inverse link function.

Currently, when we compile a model with lleaves, the link function gets hard-wired into it, so we first need to undo that, then add the base margin, and then apply the inverse link function again.

Two options:

  1. Always compile the model without link function, add a raw_score argument to predict and apply the inverse link function in Python. I don't think there's a massive performance penalty for that.
  2. Add an option to compile models without link function and leave it to the user to deal with it.
siboehm commented 3 years ago

A hacky fix for now is to edit the model.txt, replacing objective=<your objective function> (top block) with objective=regression. Then lleaves won't add any link function and you'll get back raw scores.

Considering the options:

  1. Implementing raw_score would be nice since it gets lleaves closer to the LightGBM interface. I probably won't implement this in Python since there is a severe performance hit for small batches but it could be a flag of the LLVM-function. This would yield an extra branch, but that branch would be well predictable, hence no perf hit (to be tested)
  2. Just adding a compile flag would the easiest solution and I guess it's fine to burden the user with making this decision upfront.

This is mainly an API question, not an implementation one. I'll think about it for a few days and implement something.

siboehm commented 3 years ago

For now I'll add raw_score as a compilation parameter. I'll probably make it a runtime parameter at some point, but I didn't want to break the binary interface.