This project can be used as the basis for an operational ML training and prediction capability in SQL Server and Azure SQL. This functionality is bundled up in a database project so that it can be deployed against different targets and managed under source control.
As everyone has their own set of requirements this can be used as-is or as a starting point for discussion as to what your requirements are and what the resulting data model should look like.
Example use | Key components | General
The [samples]
schema shows how the base [ml]
components can be used in a realistic scenario to train a feature engineering model and a predictive model, store these and then use them in a predictive model scenario.
train_iris
stored prcedure takes a bulk of data, partitions into training and testing data, performs feature engineering, generates a model, applies the feature engineering model to the test data, predicts for the test data, and then writes all the results to the [model]
table. This stored procedure could be run on a scheduled basis or inside an SSIS workload to regularly retrain based on the latest data.score_iris
stored procedure takes 1 or more rows of data, applies the feature engineering transformation model, and then uses the prediction model to score the row(s). Whilst it's doing this it logs to the database and outputs all the scores and associated information to the database so that the application requesting the predictions gets them back as the result of the stored procedure but we have a record in the databse for monitoring our solution[model]
Store any reusable object like a predictive model along with information about how the object was created
The [model]
table is a temporal table that contains some sort of model and information aout how that model was developed.
A temporal table is a SQL Server 2016+ feature that maintains a queryable record of changes to values stored in the table. This means we can have a model like churn_likelihood
that we update on a regular basis. The older versions of the model object and the measures of fit associated with it will be retrievable for performing what-if analysis, tracking model or feature drift, or providing a record with how automated decisions (see Note 1) were made at a given point in time. Use TEMPORAL TABLE SQL queries to gain insight into model changes over time.
The table contains columns for storing:
serialize(lm())
which is the serialized storable version of a linear regression model that will typically sit in-memoryTo help people publish models from different capabilities (see Note 2), there is an upsert_model
stored procedure that takes the relevant information required as inputs and will handle inserting the model if a new name is provided or updating the model if the name already exists.
[prediction]
Store the results of any prediction request here with information about how the result was derived
The [prediction]
table is designed as a flexible log store for results of prediction workloads. You can directly add predictions etc to other tables instead of or as well as this table.
The table contains columns for storing:
To make it easier to implement prediction logging, there is an insert_prediction
stored procedure to facilitate stored procedure development.
[prediction_log]
Track prediction requests
The [prediction_log]
table is for storing prediction requests and associating them with any errors or batch of inserted predictions.
The table contains columns for storing:
This log table should be useful for assessing prediction workload patterns, tying back predictions to requests, and tracking problems with the the request process.
To facilitate use of the log table, two stored procedures have been created to ease prediction stored procedure development and reduce code duplication. The begin_prediction_log
and end_prediction_log
can top and tail prediction processes.
As this is intended to surround the ML Services components, you can only deploy to 2016 and beyond but due to the merging nature of the feature set in SQL Server, the version you have determines the variety of workloads and languages you will be able to deploy.
To run the samples and similar stored procedures, any R dependencies need to be installed. One of the easiest ways is with the sqlmlutils
package.
git clone git@github.com:microsoft/sqlmlutils.git
Then open RStudio
install.packages("remotes")
remotes::install_local("sqlmlutils/R/")
library(sqlmlutils)
connection <- connectionInfo(
server= "rinsqldb.database.windows.net",
database = "rinsqldb",
uid = "",
pwd = "")
sql_install.packages(connectionString = connection,
pkgs =c("tidyverse","recipes","sessioninfo","rsample","broom","jsonlite"),
verbose = TRUE, scope = "PUBLIC")
Thanks for reading this section! Hopefully you're interested in possibly contributing into this project. If you're a SQL person on a data science person this is a chance to pick up or deepen skills in these areas. We've put together a bunch of help wanted issues that you could pick up to learn about unit testing, stored procedure writing, model development and more.
If you're ready to go and you'd like to get started, hit the Fork button now.
If you'd like to help but aren't quite sure how to get started you can also chat to us on Slack and we'll mentor you through the process.
SQL Server
driver as varbinary representations of models can have problems being transported to the DB using this driver.