rclement / sqlite-ml

An SQLite extension for machine learning
Apache License 2.0
47 stars 2 forks source link

As a loadable extension (with Rust) #1

Open asg017 opened 1 year ago

asg017 commented 1 year ago

Hey @rclement ! Sorry for the delay, but here's a continuation of the discussions from https://github.com/rclement/datasette-ml/issues/3

Supporting queries in sqlite-loadable-rs

I created a new issue in sqlite-loadable-rs to track adding querying support to that library. That way we can make queries like CREATE TABLE / INSERT INTO in the extension itself, which I know is a blocker for this work. Not sure if I'll have an ETA soon, but once that's in, it should unblock us here to make a proper loadable extension

In Pure Rust, no Python?

The native-ext branch uses PyO3 for ML algorithms, which will be great to get started, but bundling Python in an extension can be tricky. I have the sqlite-python project that lets you define loadable SQLite extensions with Python, which can be useful here, but can come with problems:

There's the linfa project that could help us move to a pure-Rust extension. It's the more complete scikit-like Rust crate I can find, so we can use those algorithms in sqlite-ml to remove the Python dependency.

I played around with it a bit and it seems pretty advanced, most of the models seem to support serializing to a byte array (so we can persist a trained model across connections). It may not have 100% of the algorithms that scikit has, but probably enough for this?

Defining the SQL API

I've been thinking about a few different ways to express sqlite-ml operations in pure SQL, using eponymous virtual tables, table functions, and regular scalar functions. Here are some of my thoughts, but definitely not complete:

ml_experiments and friends

I think all these tables can be shadow tables that are read-only to users, since I dont think users will ever need to insert/update rows in these directly:

ml_train

ml_train can be an eponymous virtual table that users can INSERT into, to create new experiments/models.

insert into ml_train 
    values (
      'Iris prediction',  -- name of experiemnt
      'classification',  -- prediction type
      'logistic_regression',  -- algorithm
      'ml_datasets.iris',  -- source data. can be a table/view name, optional schema
      'target' -- target column
    );

ml_predict

A table function that takes in a JSON array of values and predicts the target column:

select 
  iris.*, 
  prediction.prediction
from ml_datasets.iris as iris
from ml_predict(
  'Iris prediction', 
  json_array( 
    iris.sepal_length, 
    iris.sepal_width, 
    iris.petal_length, 
    iris.petal_width
  )
) from prediction;

ml_load_dataset

If we wanted to just inline those default datasets into the extension, we could have eponymous virtual tables for each one like so:

select * from ml_datasets_iris;
select * from ml_datasets_breast_cancer;
select * from ml_datasets_diabetes;

Or if we don't want to bloat the size of the extension, we could offer a separate pre-built database file that people can attach themselves:

-- here ml_datasets_path() can return the path of the pre-built SQLite database, or create one if it doesnt exust
attach database ml_datasets_path() as ml_datasets;

select * from ml_datasets.iris;
select * from ml_datasets.breast_cancer;
select * from ml_datasets.diabetes;

Again, some very loose thoughts and notes, feel free to ask about anything!

rclement commented 1 year ago

Thanks @asg017 for all your thoughts and inputs!

Pure-Rust native extension

I do agree that going a pure-Rust route should be the way to go:

With all these, the pure-Rust route seems to be easily achievable in the near future!