Open asg017 opened 1 year ago
Thanks @asg017 for all your thoughts and inputs!
I do agree that going a pure-Rust route should be the way to go:
Fiddling with PyO3 is definitely not fun past simple examples, and it requires having a complete Python environment with proper dependencies outside the native binary which is cumbersome (PostgresML does that and is not very verbose about it within the documentation).
I've quickly checked out Linfa a few weeks ago and I also think this library might be a good pure-Rust alternative to Scikit-Learn. The only immediate things not provided by Linfa are the built-in datasets to get started but those can be direcly embedded in CSV format within the native extension.
In the future, there is even the Rust Tensorflow library to enable Deep Learning integration.
With all these, the pure-Rust route seems to be easily achievable in the near future!
Hey @rclement ! Sorry for the delay, but here's a continuation of the discussions from https://github.com/rclement/datasette-ml/issues/3
Supporting queries in
sqlite-loadable-rs
I created a new issue in
sqlite-loadable-rs
to track adding querying support to that library. That way we can make queries likeCREATE TABLE
/INSERT INTO
in the extension itself, which I know is a blocker for this work. Not sure if I'll have an ETA soon, but once that's in, it should unblock us here to make a proper loadable extensionIn Pure Rust, no Python?
The
native-ext
branch uses PyO3 for ML algorithms, which will be great to get started, but bundling Python in an extension can be tricky. I have the sqlite-python project that lets you define loadable SQLite extensions with Python, which can be useful here, but can come with problems:There's the linfa project that could help us move to a pure-Rust extension. It's the more complete scikit-like Rust crate I can find, so we can use those algorithms in
sqlite-ml
to remove the Python dependency.I played around with it a bit and it seems pretty advanced, most of the models seem to support serializing to a byte array (so we can persist a trained model across connections). It may not have 100% of the algorithms that scikit has, but probably enough for this?
Defining the SQL API
I've been thinking about a few different ways to express
sqlite-ml
operations in pure SQL, using eponymous virtual tables, table functions, and regular scalar functions. Here are some of my thoughts, but definitely not complete:ml_experiments
and friendsI think all these tables can be shadow tables that are read-only to users, since I dont think users will ever need to insert/update rows in these directly:
ml_train
ml_train
can be an eponymous virtual table that users canINSERT
into, to create new experiments/models.ml_predict
A table function that takes in a JSON array of values and predicts the target column:
ml_load_dataset
If we wanted to just inline those default datasets into the extension, we could have eponymous virtual tables for each one like so:
Or if we don't want to bloat the size of the extension, we could offer a separate pre-built database file that people can attach themselves:
Again, some very loose thoughts and notes, feel free to ask about anything!