Add a new interpretable algorithm, Automatic Piecewise Linear Regression

mathias-von-ottenbreit commented 5 months ago

Perhaps you can add a light wrapper to the below glassbox algorithm in InterpretML? I believe that adding it would give users of InterpretML more freedom of choice regarding glassbox algorithms.

The algorithm, Automatic Piecewise Linear Regression (APLR), produces glassbox models and matches EBM on predictiveness (slightly better than EBM on some datasets, slightly worse on others). It can be used for regression or classification.

Relative advantages of APLR:

Better interpretability primarily because APLR's variable selection leads to sparser solutions. EBM seems to always use all of the predictors.
Smoother predictions because of the linear base learners.
Significantly faster fitting time on large datasets with many predictors.
More customizable, for example the user can pass their own loss function, validation tuning metric or link function.

Relative advantages of EBM:

Faster prediction times.
Possibility to manually edit a model.
Possibility to see uncertainty estimates in the plots when interpreting terms.
APLR can sometimes be vulnerable to outliers in predictor values due to the linear base learners. In such cases one should for example winsorize predictors before passing them to APLR.

Best regards, Mathias Ottenbreit

paulbkoch commented 5 months ago

Very interesting @mathias-von-ottenbreit. I could see how using Piecewise Linear Regression might improve on piecewise constant EBMs for certain datasets. Looks like you've been working on this for a while! Even though the paper is recent, I see the first release of APLR was in May 2022.

Are there any visualizations available?

mathias-von-ottenbreit commented 5 months ago

Hi @paulbkoch. Yes, I have spent many hours working on the algorithm, improving it over time and adding new functionality. There are a couple visualizations in this presentation on slide 6. Also, running this example script will create and save plots for a model trained on another dataset (if you run it then it is probably a good idea to also look at the dataframe estimated_feature_importance to find the most interesting plots to look at).

paulbkoch commented 5 months ago

I think this would be a great capability to add. Would you like to submit a PR with the wrapper?

mathias-von-ottenbreit commented 5 months ago

Great! I'm happy to contribute. This could be as you suggested by submitting a PR with the wrapper or by helping you create the wrapper. If you prefer me to submit a PR then I need some initial help as I am not familiar with your codebase.

paulbkoch commented 5 months ago

Thanks @mathias-von-ottenbreit -- I think this would be great to add, but when I have time these days I mostly focus on improving EBMs. If you'd like to submit a PR, I can work with you on that. I would start by looking at our other glassbox models:

https://github.com/interpretml/interpret/blob/develop/python/interpret-core/interpret/glassbox/_linear.py https://github.com/interpretml/interpret/blob/develop/python/interpret-core/interpret/glassbox/_decisiontree.py

I'm also happy to leave this here as an open issue, but I want to be clear it might be a while.

paulbkoch commented 5 months ago

Another more involved option, but possibly more interesting, would be to merge the APLR and EBM algorithms to allow users to choose from either within the same dataset. We have a feature_types parameter that allows users to select between "continuous" and "nominal". Perhaps "aplr" could be a 3rd option, which would allow users to designate individual features as using piecewise linear.

mathias-von-ottenbreit commented 5 months ago

That's ok @paulbkoch, I will submit a PR with a wrapper.

I took a quick look at the files that you mentioned. To me it seems that it should be possible to define two classes, APLRRegressor and APLRClassifier, that inherit from the aplr package, but get two additional methods each, explain_global and explain_local. Do you have some documentation/guidelines/requirements regarding the explain methods? (while I should be able to figure things out from the code, it will probably be easier if I can get some explanations on exactly how those methods should work). Thanks in advance.

About the more involved option that you mentioned I believe that it certainly sounds interesting and should be doable. Perhaps we should complete the wrapper first and then get back to this option at a later stage as it will probably require more work?

paulbkoch commented 5 months ago

That makes sense @mathias-von-ottenbreit. There's a bit of documentation (not really enough) on the API at https://interpret.ml/docs/framework.html. The part that might interest you is near the bottom under "Interpret API".

A simple description of the Interpret API is that explain_local and explain_global are designed to return python objects which can be serialized, and possibly reloaded on another machine. They both return objects which derive from FeatureValueExplanation, so you'll want to make an APLRExplanation class that derives from FeatureValueExplanation. When the user wants to visualize the explanations, they call our show function, which in turn calls the visualize function on your class (example: https://github.com/interpretml/interpret/blob/develop/python/interpret-core/interpret/glassbox/_linear.py#L370-L432). APLRExplanation will then create a plotly object inside of the visualize function and return that. Interpret will then do the work to display the plotly object inside the jupyter notebook cell.

Here's another Explanation class that you should look at which handles the EBM explanations. For APLR you'll need to make a plotly graph somewhere in between the LinearExplanation and EBMExplanation: https://github.com/interpretml/interpret/blob/716b8ccb95c5065109ee729a3ca3a29e675db07d/python/interpret-core/interpret/glassbox/_ebm/_ebm.py#L79

Happy to go into more detail if you get stuck or have other questions.

mathias-von-ottenbreit commented 4 months ago

Hi @paulbkoch.

I have forked the develop branch (is that the correct branch to start from?). But how do I build and install the package? I tried running "pip install ." from the python/interpret_core folder but that gave the below error message.

error: [Errno 2] No such file or directory: '/home/mathiaso/Documents/mathias/git_projects/interpret/python/interpret-core/../../shared/vis/dist/interpret-inline.js'

Thanks in advance.

paulbkoch commented 4 months ago

Hi @mathias-von-ottenbreit, yes, develop is the right branch to start from.

cd to the "shared/vis" directory and run: npm install npm run build-prod

To build the native library, run build.sh or build.bat in the root directory.

paulbkoch commented 4 months ago

After those are built, install with these flags:

pip install -e .[debug,notebook,plotly,lime,sensitivity,shap,linear,skoperules,treeinterpreter,dash,testing]

Docs at: https://interpret.ml/docs/installation-guide.html

mathias-von-ottenbreit commented 4 months ago

Hi again @paulbkoch.

I am almost ready to submit the wrapper from this fork.

The only thing that remains is writing documentation, for example something comparable to the documentation for the linear model.

I see that this file is located in docs/interpret/lr.ipynb. Should I create an aplr.ipynb in that folder?
There are also corresponding api references, such as for logistic regression. I found the file docs/interpret/python/api/LogisticRegression.ipynb. Should I create APLRRegressor.ipyndb and APLRClassifier.ipyndb in that folder?
Is there anything more that needs to be done regarding documentation?

Thanks in advance.

paulbkoch commented 4 months ago

Hi @mathias-von-ottenbreit, that’s great! Your PR will be the first one to the InterpretML visualization system.

The places you identified are the same ones I would suggest putting the docs.

mathias-von-ottenbreit commented 4 months ago

I have submitted this pull request. However it fails pytest checks in the following file: tests/glassbox/test_aplr.py. This happens regardless of OS and Python version configuration. The tests run fine on my computer, but in the pytest jobs they fail on the first line invoking the fit() method in APLR (the native APLR, not the wrapper). Based on the error messages this could for example be a memory issue with too little memory on the computers that run the pytests. @paulbkoch, what can I do here to figure this out? Could you perhaps try to run the test functions in test_aplr.py to see if they work on your end? Thanks in advance.

paulbkoch commented 4 months ago

I tried installing it on my laptop (Windows with 32 GB ram). It seems to crash in the APLR native code here too:

pytest test_aplr.py
================================================= test session starts =================================================
platform win32 -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0
rootdir: C:\src\interpretaplr\interpret\python\interpret-core
configfile: pytest.ini
plugins: anyio-4.4.0, dash-2.17.1, cov-5.0.0, xdist-3.6.1
collected 2 items

test_aplr.py Windows fatal exception: access violation

Current thread 0x00003554 (most recent call first):
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\aplr\aplr.py", line 200 in fit
  File "C:\src\interpretaplr\interpret\python\interpret-core\tests\glassbox\test_aplr.py", line 21 in test_regression
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\python.py", line 162 in pytest_pyfunc_call
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_callers.py", line 103 in _multicall
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_hooks.py", line 513 in __call__
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\python.py", line 1632 in runtest
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\runner.py", line 173 in pytest_runtest_call
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_callers.py", line 103 in _multicall
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_hooks.py", line 513 in __call__
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\runner.py", line 241 in <lambda>
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\runner.py", line 341 in from_call
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\runner.py", line 240 in call_and_report
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\runner.py", line 135 in runtestprotocol
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\runner.py", line 116 in pytest_runtest_protocol
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_callers.py", line 103 in _multicall
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_hooks.py", line 513 in __call__
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\main.py", line 364 in pytest_runtestloop
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_callers.py", line 103 in _multicall
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_hooks.py", line 513 in __call__
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\main.py", line 339 in _main
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\main.py", line 285 in wrap_session
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\main.py", line 332 in pytest_cmdline_main
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_callers.py", line 103 in _multicall
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\pluggy\_hooks.py", line 513 in __call__
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\config\__init__.py", line 178 in main
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\site-packages\_pytest\config\__init__.py", line 206 in console_main
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\Scripts\pytest.exe\__main__.py", line 7 in <module>
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\runpy.py", line 86 in _run_code
  File "C:\Users\paulkoch\AppData\Local\anaconda3\envs\interpret_pypi\lib\runpy.py", line 196 in _run_module_as_main

paulbkoch commented 4 months ago

In terms of suggestions:

1) I would probably first try installing APLR on a clean new machine that isn't your development system. Perhaps use a VM on your existing system if you don't have a second computer. During development you may have introduced some kind of system specific requirement that isn't on new systems.

2) If you can't replicate on a local system where debugging would be easier, you can use azure pipelines to try different things. This will be slower of course since you have to wait a few minutes per attempt. If you modify the following lines you can run your own script commands to do things like clone APLR from a github testing branch that you control and then build APLR in the Azure runners, and then install using "pip install .". I'll later squash your changes, so feel free to make the commits messy in your PR until you're done: https://github.com/mathias-von-ottenbreit/interpret/blob/develop/azure-pipelines.yml#L184-L193

3) Add logging to your public APLR package and publish an APLR release with logging available. Then you can isolate where the crash is happening. We had similar issues early on in interpret, so I added combined python/native logging system. Here's our logging code. You could copy this into your repo if you like, or use something else: https://github.com/interpretml/interpret/blob/develop/shared/libebm/unzoned/logging.cpp https://github.com/interpretml/interpret/blob/develop/shared/libebm/unzoned/logging.h https://github.com/interpretml/interpret/blob/73634c8c1e11c6c91486e228c3d0b01803a41b00/python/interpret-core/interpret/utils/_native.py#L75 https://github.com/interpretml/interpret/blob/develop/python/interpret-core/interpret/utils/_native.py#L167-L211 https://github.com/interpretml/interpret/blob/develop/python/interpret-core/interpret/utils/_native.py#L962-L972

The last option is obviously quite a bit more work, but a good logging system is essential IMHO for debugging segfaults in github OSS projects were you're lucky if a user will reply 2-3 times to help you diagnose a crash.

paulbkoch commented 4 months ago

Closing this issue since the APLR PR has been merged.

interpretml / interpret

Add a new interpretable algorithm, Automatic Piecewise Linear Regression #547