lsorber / neo-ls-svm

Neo LS-SVM is a modern Least-Squares Support Vector Machine implementation
MIT License
16 stars 0 forks source link
conformal-prediction gaussian-processes kernel-methods kernel-ridge-regression ls-svm machine-learning prediction-intervals python support-vector-machines

Open in Dev Containers Open in GitHub Codespaces

Neo LS-SVM

Neo LS-SVM is a modern Least-Squares Support Vector Machine implementation in Python that offers several benefits over sklearn's classic sklearn.svm.SVC classifier and sklearn.svm.SVR regressor:

  1. โšก Linear complexity in the number of training examples with Orthogonal Random Features.
  2. ๐Ÿš€ Hyperparameter free: zero-cost optimization of the regularisation parameter ฮณ and kernel parameter ฯƒ.
  3. ๐Ÿ”๏ธ Adds a new tertiary objective that minimizes the complexity of the prediction surface.
  4. ๐ŸŽ Returns the leave-one-out residuals and error for free after fitting.
  5. ๐ŸŒ€ Learns an affine transformation of the feature matrix to optimally separate the target's bins.
  6. ๐Ÿชž Can solve the LS-SVM both in the primal and dual space.
  7. ๐ŸŒก๏ธ Isotonically calibrated predict_proba.
  8. โœ… Conformally calibrated predict_quantiles and predict_interval.
  9. ๐Ÿ”” Bayesian estimation of the predictive standard deviation with predict_std.
  10. ๐Ÿผ Pandas DataFrame output when the input is a pandas DataFrame.

Using

Installing

First, install this package with:

pip install neo-ls-svm

Classification and regression

Then, you can import neo_ls_svm.NeoLSSVM as an sklearn-compatible binary classifier and regressor. Example usage:

from neo_ls_svm import NeoLSSVM
from pandas import get_dummies
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Binary classification example:
X, y = fetch_openml("churn", version=3, return_X_y=True, as_frame=True, parser="auto")
X_train, X_test, y_train, y_test = train_test_split(get_dummies(X), y, test_size=0.15, random_state=42)
model = NeoLSSVM().fit(X_train, y_train)
model.score(X_test, y_test)  # 93.1% (compared to sklearn.svm.SVC's 89.6%)

# Regression example:
X, y = fetch_openml("ames_housing", version=1, return_X_y=True, as_frame=True, parser="auto")
X_train, X_test, y_train, y_test = train_test_split(get_dummies(X), y, test_size=0.15, random_state=42)
model = NeoLSSVM().fit(X_train, y_train)
model.score(X_test, y_test)  # 82.4% (compared to sklearn.svm.SVR's -11.8%)

Predicting quantiles

Neo LS-SVM implements conformal prediction with a Bayesian nonconformity estimate to compute quantiles and prediction intervals for both classification and regression. Example usage:

# Predict the house prices and their quantiles.
ลท_test = model.predict(X_test)
ลท_test_quantiles = model.predict_quantiles(X_test, quantiles=(0.025, 0.05, 0.1, 0.9, 0.95, 0.975))

When the input data is a pandas DataFrame, the output is also a pandas DataFrame. For example, printing the head of ลท_test_quantiles yields:

house_id 0.025 0.05 0.1 0.9 0.95 0.975
1357 114283.0 124767.6 133314.0 203162.0 220407.5 245655.3
2367 85518.3 91787.2 93709.8 107464.3 108472.6 114482.3
2822 147165.9 157462.8 167193.1 243646.5 263324.4 291963.3
2126 81788.7 88738.1 91367.4 111944.9 114800.7 122874.5
1544 94507.1 108288.2 120184.3 222630.5 248668.2 283703.4

Let's visualize the predicted quantiles on the test set:

Expand to see the code that generated the graph above ```python import matplotlib.pyplot as plt import matplotlib.ticker as ticker %config InlineBackend.figure_format = "retina" plt.rcParams["font.size"] = 8 idx = (-ลท_test.sample(50, random_state=42)).sort_values().index y_ticks = list(range(1, len(idx) + 1)) plt.figure(figsize=(4, 5)) for j in range(3): end = ลท_test_quantiles.shape[1] - 1 - j coverage = round(100 * (ลท_test_quantiles.columns[end] - ลท_test_quantiles.columns[j])) plt.barh( y_ticks, ลท_test_quantiles.loc[idx].iloc[:, end] - ลท_test_quantiles.loc[idx].iloc[:, j], left=ลท_test_quantiles.loc[idx].iloc[:, j], label=f"{coverage}% Prediction interval", color=["#b3d9ff", "#86bfff", "#4da6ff"][j], ) plt.plot(y_test.loc[idx], y_ticks, "s", markersize=3, markerfacecolor="none", markeredgecolor="#e74c3c", label="Actual value") plt.plot(ลท_test.loc[idx], y_ticks, "s", color="blue", markersize=0.6, label="Predicted value") plt.xlabel("House price") plt.ylabel("Test house index") plt.xlim(0, 500e3) plt.yticks(y_ticks, y_ticks) plt.tick_params(axis="y", labelsize=6) plt.grid(axis="x", color="lightsteelblue", linestyle=":", linewidth=0.5) plt.gca().xaxis.set_major_formatter(ticker.StrMethodFormatter("${x:,.0f}")) plt.gca().spines["top"].set_visible(False) plt.gca().spines["right"].set_visible(False) plt.legend() plt.tight_layout() plt.show() ```

Predicting intervals

In addition to quantile prediction, you can use predict_interval to predict conformally calibrated prediction intervals. Compared to quantiles, these focus on reliable coverage over quantile accuracy. Example usage:

# Compute prediction intervals for the houses in the test set.
ลท_test_interval = model.predict_interval(X_test, coverage=0.95)

# Measure the coverage of the prediction intervals on the test set
coverage = ((ลท_test_interval.iloc[:, 0] <= y_test) & (y_test <= ลท_test_interval.iloc[:, 1])).mean()
print(coverage)  # 94.3%

When the input data is a pandas DataFrame, the output is also a pandas DataFrame. For example, printing the head of ลท_test_interval yields:

house_id 0.025 0.975
1357 114283.0 245849.2
2367 85518.3 114411.4
2822 147165.9 292179.2
2126 81788.7 122838.1
1544 94507.1 284062.6

Benchmarks

We select all binary classification and regression datasets below 1M entries from the AutoML Benchmark. Each dataset is split into 85% for training and 15% for testing. We apply skrub.TableVectorizer as a preprocessing step for neo_ls_svm.NeoLSSVM and sklearn.svm.SVC,SVR to vectorize the pandas DataFrame training data into a NumPy array. Models are fitted only once on each dataset, with their default settings and no hyperparameter tuning.

Binary classification ROC-AUC on 15% test set: | dataset | LGBMClassifier | NeoLSSVM | SVC | |---------------------------------:|-----------------:|-----------------:|-----------------:| | ada | ๐Ÿฅˆ 90.9% (0.1s) | ๐Ÿฅ‡ 90.9% (1.9s) | 83.1% (4.5s) | | adult | ๐Ÿฅ‡ 93.0% (0.5s) | ๐Ÿฅˆ 89.0% (15.7s) | / | | amazon_employee_access | ๐Ÿฅ‡ 85.6% (0.5s) | ๐Ÿฅˆ 64.5% (9.0s) | / | | arcene | ๐Ÿฅˆ 78.0% (0.6s) | 70.0% (6.3s) | ๐Ÿฅ‡ 82.0% (4.0s) | | australian | ๐Ÿฅ‡ 88.3% (0.2s) | 79.9% (1.7s) | ๐Ÿฅˆ 81.9% (0.1s) | | bank-marketing | ๐Ÿฅ‡ 93.5% (0.5s) | ๐Ÿฅˆ 91.0% (11.8s) | / | | blood-transfusion-service-center | 62.0% (0.3s) | ๐Ÿฅ‡ 71.0% (2.2s) | ๐Ÿฅˆ 69.7% (0.1s) | | churn | ๐Ÿฅ‡ 91.7% (0.6s) | ๐Ÿฅˆ 81.0% (2.1s) | 70.6% (2.9s) | | click_prediction_small | ๐Ÿฅ‡ 67.7% (0.5s) | ๐Ÿฅˆ 66.6% (10.9s) | / | | jasmine | ๐Ÿฅ‡ 86.1% (0.3s) | 79.5% (1.9s) | ๐Ÿฅˆ 85.3% (7.4s) | | kc1 | ๐Ÿฅ‡ 78.9% (0.3s) | ๐Ÿฅˆ 76.6% (1.4s) | 45.7% (0.6s) | | kr-vs-kp | ๐Ÿฅ‡ 100.0% (0.6s) | 99.2% (1.6s) | ๐Ÿฅˆ 99.4% (2.3s) | | madeline | ๐Ÿฅ‡ 93.1% (0.5s) | 65.6% (1.9s) | ๐Ÿฅˆ 82.5% (19.8s) | | ozone-level-8hr | ๐Ÿฅˆ 91.2% (0.4s) | ๐Ÿฅ‡ 91.6% (1.7s) | 72.9% (0.6s) | | pc4 | ๐Ÿฅ‡ 95.3% (0.3s) | ๐Ÿฅˆ 90.9% (1.5s) | 25.7% (0.3s) | | phishingwebsites | ๐Ÿฅ‡ 99.5% (0.5s) | ๐Ÿฅˆ 98.9% (3.6s) | 98.7% (10.0s) | | phoneme | ๐Ÿฅ‡ 95.6% (0.3s) | ๐Ÿฅˆ 93.5% (2.1s) | 91.2% (2.0s) | | qsar-biodeg | ๐Ÿฅ‡ 92.7% (0.4s) | ๐Ÿฅˆ 91.1% (5.2s) | 86.8% (0.3s) | | satellite | ๐Ÿฅˆ 98.7% (0.2s) | ๐Ÿฅ‡ 99.5% (1.9s) | 98.5% (0.4s) | | sylvine | ๐Ÿฅ‡ 98.5% (0.2s) | ๐Ÿฅˆ 97.1% (2.0s) | 96.5% (3.8s) | | wilt | ๐Ÿฅˆ 99.5% (0.2s) | ๐Ÿฅ‡ 99.8% (1.8s) | 98.9% (0.5s) |
Regression Rยฒ on 15% test set: | dataset | LGBMRegressor | NeoLSSVM | SVR | |------------------------------:|----------------:|-----------------:|-----------------:| | abalone | ๐Ÿฅˆ 56.2% (0.1s) | ๐Ÿฅ‡ 59.5% (2.5s) | 51.3% (0.7s) | | boston | ๐Ÿฅ‡ 91.7% (0.2s) | ๐Ÿฅˆ 89.6% (1.1s) | 35.1% (0.0s) | | brazilian_houses | ๐Ÿฅˆ 55.9% (0.3s) | ๐Ÿฅ‡ 88.4% (3.7s) | 5.4% (7.0s) | | colleges | ๐Ÿฅ‡ 58.5% (0.4s) | ๐Ÿฅˆ 42.2% (6.6s) | 40.2% (15.1s) | | diamonds | ๐Ÿฅ‡ 98.2% (0.3s) | ๐Ÿฅˆ 95.2% (13.7s) | / | | elevators | ๐Ÿฅ‡ 87.7% (0.5s) | ๐Ÿฅˆ 82.6% (6.5s) | / | | house_16h | ๐Ÿฅ‡ 67.7% (0.4s) | ๐Ÿฅˆ 52.8% (6.0s) | / | | house_prices_nominal | ๐Ÿฅ‡ 89.0% (0.3s) | ๐Ÿฅˆ 78.3% (2.1s) | -2.9% (1.2s) | | house_sales | ๐Ÿฅ‡ 89.2% (0.4s) | ๐Ÿฅˆ 77.8% (5.9s) | / | | mip-2016-regression | ๐Ÿฅ‡ 59.2% (0.4s) | ๐Ÿฅˆ 34.9% (5.8s) | -27.3% (0.4s) | | moneyball | ๐Ÿฅ‡ 93.2% (0.3s) | ๐Ÿฅˆ 91.3% (1.1s) | 0.8% (0.2s) | | pol | ๐Ÿฅ‡ 98.7% (0.3s) | ๐Ÿฅˆ 74.9% (4.6s) | / | | quake | -10.7% (0.2s) | ๐Ÿฅ‡ -1.0% (1.6s) | ๐Ÿฅˆ -10.7% (0.1s) | | sat11-hand-runtime-regression | ๐Ÿฅ‡ 78.3% (0.4s) | ๐Ÿฅˆ 61.7% (2.1s) | -56.3% (5.1s) | | sensory | ๐Ÿฅ‡ 29.2% (0.1s) | 3.0% (1.6s) | ๐Ÿฅˆ 16.4% (0.0s) | | socmob | ๐Ÿฅ‡ 79.6% (0.2s) | ๐Ÿฅˆ 72.5% (6.6s) | 30.8% (0.1s) | | space_ga | ๐Ÿฅ‡ 70.3% (0.3s) | ๐Ÿฅˆ 43.6% (1.5s) | 35.9% (0.2s) | | tecator | ๐Ÿฅˆ 98.3% (0.1s) | ๐Ÿฅ‡ 99.4% (0.9s) | 78.5% (0.0s) | | us_crime | ๐Ÿฅˆ 62.8% (0.6s) | ๐Ÿฅ‡ 63.0% (2.3s) | 6.7% (0.8s) | | wine_quality | ๐Ÿฅ‡ 45.6% (0.2s) | ๐Ÿฅˆ 36.5% (2.8s) | 16.4% (1.6s) |

Contributing

Prerequisites
1. Set up Git to use SSH 1. [Generate an SSH key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent#generating-a-new-ssh-key) and [add the SSH key to your GitHub account](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account). 1. Configure SSH to automatically load your SSH keys: ```sh cat << EOF >> ~/.ssh/config Host * AddKeysToAgent yes IgnoreUnknown UseKeychain UseKeychain yes EOF ```
2. Install Docker 1. [Install Docker Desktop](https://www.docker.com/get-started). - Enable _Use Docker Compose V2_ in Docker Desktop's preferences window. - _Linux only_: - Export your user's user id and group id so that [files created in the Dev Container are owned by your user](https://github.com/moby/moby/issues/3206): ```sh cat << EOF >> ~/.bashrc export UID=$(id --user) export GID=$(id --group) EOF ```
3. Install VS Code or PyCharm 1. [Install VS Code](https://code.visualstudio.com/) and [VS Code's Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers). Alternatively, install [PyCharm](https://www.jetbrains.com/pycharm/download/). 2. _Optional:_ install a [Nerd Font](https://www.nerdfonts.com/font-downloads) such as [FiraCode Nerd Font](https://github.com/ryanoasis/nerd-fonts/tree/master/patched-fonts/FiraCode) and [configure VS Code](https://github.com/tonsky/FiraCode/wiki/VS-Code-Instructions) or [configure PyCharm](https://github.com/tonsky/FiraCode/wiki/Intellij-products-instructions) to use it.
Development environments The following development environments are supported: 1. โญ๏ธ _GitHub Codespaces_: click on _Code_ and select _Create codespace_ to start a Dev Container with [GitHub Codespaces](https://github.com/features/codespaces). 1. โญ๏ธ _Dev Container (with container volume)_: click on [Open in Dev Containers](https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/lsorber/neo-ls-svm) to clone this repository in a container volume and create a Dev Container with VS Code. 1. _Dev Container_: clone this repository, open it with VS Code, and run Ctrl/โŒ˜ + โ‡ง + P โ†’ _Dev Containers: Reopen in Container_. 1. _PyCharm_: clone this repository, open it with PyCharm, and [configure Docker Compose as a remote interpreter](https://www.jetbrains.com/help/pycharm/using-docker-compose-as-a-remote-interpreter.html#docker-compose-remote) with the `dev` service. 1. _Terminal_: clone this repository, open it with your terminal, and run `docker compose up --detach dev` to start a Dev Container in the background, and then run `docker compose exec dev zsh` to open a shell prompt in the Dev Container.
Developing - This project follows the [Conventional Commits](https://www.conventionalcommits.org/) standard to automate [Semantic Versioning](https://semver.org/) and [Keep A Changelog](https://keepachangelog.com/) with [Commitizen](https://github.com/commitizen-tools/commitizen). - Run `poe` from within the development environment to print a list of [Poe the Poet](https://github.com/nat-n/poethepoet) tasks available to run on this project. - Run `poetry add {package}` from within the development environment to install a run time dependency and add it to `pyproject.toml` and `poetry.lock`. Add `--group test` or `--group dev` to install a CI or development dependency, respectively. - Run `poetry update` from within the development environment to upgrade all dependencies to the latest versions allowed by `pyproject.toml`. - Run `cz bump` to bump the package's version, update the `CHANGELOG.md`, and create a git tag.