siboehm / lleaves

Compiler for LightGBM gradient-boosted trees, based on LLVM. Speeds up prediction by ≥10x.
https://lleaves.readthedocs.io/en/latest/
MIT License
364 stars 29 forks source link

Improve Python API for model serialization #25

Open TomScheffers opened 2 years ago

TomScheffers commented 2 years ago

I am loading a model with thousands of trees, which takes approx. 10minutes. Therefore I want to compile the model once and then serialize it to file. Pickle or dill give the following error: "ValueError: ctypes objects containing pointers cannot be pickled". Is there a way to save/load the file to/from disk? Thanks :)

siboehm commented 2 years ago

There's a cache=<filepath> parameter for lleaves.Model.compile(). Does that do what you're looking for? See docs for more info. I looked into supporting pickling a while ago, but the cache parameter seemed like the cleaner solution.

TomScheffers commented 2 years ago

Thanks for your quick response. That should do the job! A nice addition would be to have a @classmethod (lleaves.Model.from_cache) that initializes a model directly from cache, as now you still have to initialize with the model_txt.

Love your work on this package. FYI: I get a ~10x speedup compared to the lightgbm.predict method, using a lot of categorical variables.

siboehm commented 2 years ago

Yeah, you're right, classmethod would be nicer. Currently, what's stored in the cache is an ELF file (on Linux), containing the compiled function. Recreating a lleaves.Model from the ELF file alone would require storing information about eg the pandas_categoricals (which is a list of lists of strings) as a static variable into the ELF file, which sounds like a PITA.

I might look into this again at some point maybe. I assume there'll either be some way to enable pickling, or I'll serialize the pandas categorical list somehow or I'll have a "light"-version of pickling, where the model can be pickled but it will not include the compiled function, requiring you to store 2 files (the pickled model, and the ELF cache file).

Love your work on this package. FYI: I get a ~10x speedup compared to the lightgbm.predict method, using a lot of categorical variables.

I'm glad to hear lleaves is working for you! :)