rclement / datasette-ml

A Datasette plugin providing an MLOps platform to train, eval and predict machine learning models
Apache License 2.0
16 stars 0 forks source link

As a SQLite extension #3

Closed asg017 closed 1 year ago

asg017 commented 1 year ago

I love the idea of doing ML tasks in SQLite, and would love to see this work as a SQLite extension! That way it's portable between programming languages, and usable outside of Datasette.

Though finding the right SQL API to use will be a challenge: a PostgresML-inspired option would involve a lot of special one-off scalar functions like your sqml_load_dataset() or sqml_train(), but I think we can take advantage of SQLite's virtual table mechanism to get a nicer API that plays well with Datasette.

For example, instead of:

SELECT sqml_train(
  'Iris prediction',
  'classification',
  'logistic_regression',
  'dataset_iris',
  'target'
) AS training;

We could instead have a virtual table module like ml_classification, that makes virtual tables like so:


-- defines the label + target columns, but not the data
create virtual table iris_prediction using ml_classification(
  algorithm="logistic_regression",
  sepal_length float,
  sepal_width float,
  petal_length float,
  petal_width float,
  class integer TARGET
);

-- insert training data to virtual table, stored in a shadow table
insert into iris_prediction select * from iris;

-- train the predictor
insert into iris_prediction(iris_prediction) values ('train');

-- use the predictor, passing in label columns with table function syntax
select class from iris_prediction(5.1, 3.5, 1.4, 0.2);

-- run predictor on entire table, and compare predicted/actual
select 
  prediction.class as predicted,
  iris.class as actual
from iris
join iris_prediction(
  iris.sepal_length, 
  iris.sepal_width, 
  iris.petal_length, 
  iris.petal_width
) as prediction;

So instead of making predictors with scalar functions, they're instead created with virtual tables.

Though looking at the above, maybe it's nicer to use JSON array as inputs to the table function predictor:


-- syntax alternative: a single json array input column, instead of N inputs for N label columns
select class from iris_prediction(json_array(5.1, 3.5, 1.4, 0.2));

-- using generated columns to simplify syntax
create table iris(
  sepal_length float, 
  sepal_width float, 
  petal_length float, 
  petal_width float,
  -- generated column, creates JSON array of label columns
  labels as (
    json_array(sepal_length, sepal_width, petal_length, petal_width)
  )
);

-- run predictor on all rows of a table
select 
  prediction.class as predicted,
  iris.class as actual
from iris
join iris_prediction(iris.label) as prediction
where predicted != actual;

Some other random thoughts:

Would love to hear your thoughts! Also happy to do this work in a separate repository (with your guidance!), since it'll be a significant amount of non-python code and I don't wanna override your work

rclement commented 1 year ago

Thanks @asg017 for the detailed feedback and suggestions!

I love the idea of doing ML tasks in SQLite, and would love to see this work as a SQLite extension! That way it's portable between programming languages, and usable outside of Datasette.

Yes this is the intended end-goal but we had to start somewhere!

Though finding the right SQL API to use will be a challenge: a PostgresML-inspired option would involve a lot of special one-off scalar functions like your sqml_load_dataset() or sqml_train(), but I think we can take advantage of SQLite's virtual table mechanism to get a nicer API that plays well with Datasette.

Actually PostgresML uses both scalar functions (pgml.predict, pgml.predict_batch) and "eponymous virtual table"-like functions (pgml.train, pgml.load_dataset). The current implementation in datasette-ml uses scalar functions combined with JSON output when more than a scalar value is required (sqml_load_dataset, sqml_train, sqml_predict_batch).

But I do want to go the virtual-table way! This would allow to get rid of convoluted JSON output for the above mentioned functions. FYI, I tried to use sqlite-vtfunc for a pure-Python implementation but could not manage to get it to execute any vtable code, no idea why (as it is a compiled Cython module, there might be some issue with SQLite runtime version).

We could instead have a virtual table module like ml_classification, that makes virtual tables like so:

The named virtual-table syntax is kinda convoluted, couldn't we get by using eponymous virtual-tables as all the work need to be done in shadow tables behind the scenes anyway? I need to think about it as there is an experiment tracker behind the scenes.

Though looking at the above, maybe it's nicer to use JSON array as inputs to the table function predictor:

Wow! I did not know we could have generated columns like this in SQLite! This library will never cease to amaze me...

Not sure how to handle splitting here: maybe as an argument in the virtual table constructor, and the virtual table handles it?

Yes this makes sens to be an argument of the vtable, as the current sqml_train function. We need to be able to handle different splitting strategies (shuffle, time-series) and ratios.

sqlite-loadable-rs would be great here, but it's a bit lacking: There's not good shadow table support there yet, and I'd imagine we'd need that here. Also not sure how many ML algorithms are available in Rust. Could also do it in C++, which probably has all the ML algorithms we need, and easier to use shadow tables there

This is currently my biggest roadblock with sqlite-loadable-rs: there is no way to read and write back to the database from the registered functions mechanism. However, sqlite3_ext allows to do that (with some constraints).

Concerning the availability of ML algorithms: I think we should stick to the Python route for now, using pyo3 (this is what PostgresML does), even if it requires ML packages to be available in PYTHONPATH (whether a virtual environment or the system).

Not sure if there's an easy way to "serialize" a model after it's trained. It'd be great for predictors to maintain the same results after you disconnect + reconnect. For sqlite-vss the Faiss library has an API for serializing an index to a blob, which I store in a shadow table so it can live on across reconnects

The serialization mechanism is already implemented in datasette-ml: the sqml_models table contains every trained model with its serialized model as binary blob. This indeed allows to perform predictions afterwards as you would with a standard MLOps system.

rclement commented 1 year ago

To give you some more background, the current implementation is heavily influenced by PostgresML and MLFlow:

rclement commented 1 year ago

Here is my train of thought to start working on this:

  1. Extract the sqml module into a sqlite-ml Python package
  2. datasette-ml will require this module but stay unchanged for now
  3. Start experimenting again with Rust code to implement a native sqlite-ml extension, still distributed as a sqlite-ml Python package for easy deployment

I will dig-up and clean up my sqlite-ml repo where I already experimented with a native Rust SQLite extension.

rclement commented 1 year ago

I've extracted the sqml module into its own package sqlite-ml.

datasette-ml >= 0.1.1 now depends on sqlite-ml and is just a small wrapper to enable the SQLite extension within Datasette.

rclement commented 1 year ago

@asg017 I've pushed my initial experimentation building a native sqlite-ml extension using Rust + PyO3 + sqlite3_ext on the native-ext branch

I'm going to add you as a collaborator on the repo, feel free to experiment from there!

rclement commented 1 year ago

Closing this issue in favor of rclement/sqlite-ml#1