Closed asg017 closed 1 year ago
Thanks @asg017 for the detailed feedback and suggestions!
I love the idea of doing ML tasks in SQLite, and would love to see this work as a SQLite extension! That way it's portable between programming languages, and usable outside of Datasette.
Yes this is the intended end-goal but we had to start somewhere!
Though finding the right SQL API to use will be a challenge: a PostgresML-inspired option would involve a lot of special one-off scalar functions like your sqml_load_dataset() or sqml_train(), but I think we can take advantage of SQLite's virtual table mechanism to get a nicer API that plays well with Datasette.
Actually PostgresML uses both scalar functions (pgml.predict
, pgml.predict_batch
) and "eponymous virtual table"-like functions (pgml.train
, pgml.load_dataset
). The current implementation in datasette-ml
uses scalar functions combined with JSON output when more than a scalar value is required (sqml_load_dataset
, sqml_train
, sqml_predict_batch
).
But I do want to go the virtual-table way! This would allow to get rid of convoluted JSON output for the above mentioned functions. FYI, I tried to use sqlite-vtfunc
for a pure-Python implementation but could not manage to get it to execute any vtable code, no idea why (as it is a compiled Cython module, there might be some issue with SQLite runtime version).
We could instead have a virtual table module like ml_classification, that makes virtual tables like so:
The named virtual-table syntax is kinda convoluted, couldn't we get by using eponymous virtual-tables as all the work need to be done in shadow tables behind the scenes anyway? I need to think about it as there is an experiment tracker behind the scenes.
Though looking at the above, maybe it's nicer to use JSON array as inputs to the table function predictor:
Wow! I did not know we could have generated columns like this in SQLite! This library will never cease to amaze me...
Not sure how to handle splitting here: maybe as an argument in the virtual table constructor, and the virtual table handles it?
Yes this makes sens to be an argument of the vtable, as the current sqml_train
function. We need to be able to handle different splitting strategies (shuffle, time-series) and ratios.
sqlite-loadable-rs would be great here, but it's a bit lacking: There's not good shadow table support there yet, and I'd imagine we'd need that here. Also not sure how many ML algorithms are available in Rust. Could also do it in C++, which probably has all the ML algorithms we need, and easier to use shadow tables there
This is currently my biggest roadblock with sqlite-loadable-rs
: there is no way to read and write back to the database from the registered functions mechanism. However, sqlite3_ext
allows to do that (with some constraints).
Concerning the availability of ML algorithms: I think we should stick to the Python route for now, using pyo3
(this is what PostgresML does), even if it requires ML packages to be available in PYTHONPATH
(whether a virtual environment or the system).
Not sure if there's an easy way to "serialize" a model after it's trained. It'd be great for predictors to maintain the same results after you disconnect + reconnect. For sqlite-vss the Faiss library has an API for serializing an index to a blob, which I store in a shadow table so it can live on across reconnects
The serialization mechanism is already implemented in datasette-ml
: the sqml_models
table contains every trained model with its serialized model as binary blob. This indeed allows to perform predictions afterwards as you would with a standard MLOps system.
To give you some more background, the current implementation is heavily influenced by PostgresML and MLFlow:
Here is my train of thought to start working on this:
sqml
module into a sqlite-ml
Python packagedatasette-ml
will require this module but stay unchanged for nowsqlite-ml
extension, still distributed as a sqlite-ml
Python package for easy deploymentI will dig-up and clean up my sqlite-ml
repo where I already experimented with a native Rust SQLite extension.
I've extracted the sqml
module into its own package sqlite-ml
.
datasette-ml >= 0.1.1
now depends on sqlite-ml
and is just a small wrapper to enable the SQLite extension within Datasette.
@asg017 I've pushed my initial experimentation building a native sqlite-ml
extension using Rust + PyO3 + sqlite3_ext on the native-ext
branch
I'm going to add you as a collaborator on the repo, feel free to experiment from there!
Closing this issue in favor of rclement/sqlite-ml#1
I love the idea of doing ML tasks in SQLite, and would love to see this work as a SQLite extension! That way it's portable between programming languages, and usable outside of Datasette.
Though finding the right SQL API to use will be a challenge: a PostgresML-inspired option would involve a lot of special one-off scalar functions like your
sqml_load_dataset()
orsqml_train()
, but I think we can take advantage of SQLite's virtual table mechanism to get a nicer API that plays well with Datasette.For example, instead of:
We could instead have a virtual table module like
ml_classification
, that makes virtual tables like so:So instead of making predictors with scalar functions, they're instead created with virtual tables.
Though looking at the above, maybe it's nicer to use JSON array as inputs to the table function predictor:
Some other random thoughts:
sqlite-loadable-rs
would be great here, but it's a bit lacking: There's not good shadow table support there yet, and I'd imagine we'd need that here. Also not sure how many ML algorithms are available in Rust. Could also do it in C++, which probably has all the ML algorithms we need, and easier to use shadow tables theresqlite-vss
the Faiss library has an API for serializing an index to a blob, which I store in a shadow table so it can live on across reconnectsWould love to hear your thoughts! Also happy to do this work in a separate repository (with your guidance!), since it'll be a significant amount of non-python code and I don't wanna override your work