Execution and Logging Framework (ELF) 🧝‍♀️

This project can be used as the basis for an operational ML training and prediction capability in SQL Server and Azure SQL. This functionality is bundled up in a database project so that it can be deployed against different targets and managed under source control.

As everyone has their own set of requirements this can be used as-is or as a starting point for discussion as to what your requirements are and what the resulting data model should look like.

Example use | Key components | General

Example use

The [samples] schema shows how the base [ml] components can be used in a realistic scenario to train a feature engineering model and a predictive model, store these and then use them in a predictive model scenario.

the train_iris stored prcedure takes a bulk of data, partitions into training and testing data, performs feature engineering, generates a model, applies the feature engineering model to the test data, predicts for the test data, and then writes all the results to the [model] table. This stored procedure could be run on a scheduled basis or inside an SSIS workload to regularly retrain based on the latest data.
the score_iris stored procedure takes 1 or more rows of data, applies the feature engineering transformation model, and then uses the prediction model to score the row(s). Whilst it's doing this it logs to the database and outputs all the scores and associated information to the database so that the application requesting the predictions gets them back as the result of the stored procedure but we have a record in the databse for monitoring our solution

Key components

`[model]`

Store any reusable object like a predictive model along with information about how the object was created

The [model] table is a temporal table that contains some sort of model and information aout how that model was developed.

A temporal table is a SQL Server 2016+ feature that maintains a queryable record of changes to values stored in the table. This means we can have a model like churn_likelihood that we update on a regular basis. The older versions of the model object and the measures of fit associated with it will be retrievable for performing what-if analysis, tracking model or feature drift, or providing a record with how automated decisions (see Note 1) were made at a given point in time. Use TEMPORAL TABLE SQL queries to gain insight into model changes over time.

The table contains columns for storing:

A unique user friendly model name
A varbinary representation of an object e.g. serialize(lm()) which is the serialized storable version of a linear regression model that will typically sit in-memory
Observation volumes for train and test to provide a high-level overview of the data sizes involved with developing a model
Key metric type and value to enable monitoring of the key metric over time. The key measure by which you evaluate the model should be the one that aligns most to the business requirement so it could be overall accuracy, false negative rate, AUC, etc.
Other metrics should also be considered for ensuring a robust evaluation and these can be stored here in a JSON format.
Information on the R or Python version, of packages used, of seeds, and more should also be stored in a JSON for audit, debugging, and reproducibility purposes

To help people publish models from different capabilities (see Note 2), there is an upsert_model stored procedure that takes the relevant information required as inputs and will handle inserting the model if a new name is provided or updating the model if the name already exists.

`[prediction]`

Store the results of any prediction request here with information about how the result was derived

The [prediction] table is designed as a flexible log store for results of prediction workloads. You can directly add predictions etc to other tables instead of or as well as this table.

The table contains columns for storing:

A numeric and/or string prediction value
A JSON representation of all the features and their associated values used in deriving the prediction. This is important as not all the features may be sent to the prediction stored procedure and wiull be generated inside the process. (see Note 1)
A JSON representation of any extra information you think prudent to include like time to make a prediction, any warning outputs, session information

To make it easier to implement prediction logging, there is an insert_prediction stored procedure to facilitate stored procedure development.

`[prediction_log]`

Track prediction requests

The [prediction_log] table is for storing prediction requests and associating them with any errors or batch of inserted predictions.

The table contains columns for storing:

The ID of the model used to fulfil a prediction request
Where the predictions were written to
What the first and last IDs were for the predicted values, enabling you to retrieve values in that range. (see Note 3)
When the prediction process was first kicked off
If the prediction process finished successfully or errored gracefully when the process ended
The error message(s) that were encountered during the prediction process

This log table should be useful for assessing prediction workload patterns, tying back predictions to requests, and tracking problems with the the request process.

To facilitate use of the log table, two stored procedures have been created to ease prediction stored procedure development and reduce code duplication. The begin_prediction_log and end_prediction_log can top and tail prediction processes.

General

Compatibility

As this is intended to surround the ML Services components, you can only deploy to 2016 and beyond but due to the merging nature of the feature set in SQL Server, the version you have determines the variety of workloads and languages you will be able to deploy.

SQL Server 2016
- R model training and predictions
SQL Server 2017
- Windows: R and Python model training and predictions
- Linux: Microsoft rx* models in R and Python can be stored and used for predictions
SQL Server 2019
- Windows: R, Python, and Spark model training and predictions
  - Windows: R, Python, and Spark model training and predictions
Azure SQL DB
- R model training and predictions

Installing R packages

To run the samples and similar stored procedures, any R dependencies need to be installed. One of the easiest ways is with the sqlmlutils package.

Command line

git clone git@github.com:microsoft/sqlmlutils.git

Then open RStudio

R

install.packages("remotes")
remotes::install_local("sqlmlutils/R/")
library(sqlmlutils)
connection <- connectionInfo(
  server= "rinsqldb.database.windows.net",
  database = "rinsqldb",
  uid = "",
  pwd = "")

sql_install.packages(connectionString = connection, 
    pkgs =c("tidyverse","recipes","sessioninfo","rsample","broom","jsonlite"), 
    verbose = TRUE, scope = "PUBLIC")

Contributing

Thanks for reading this section! Hopefully you're interested in possibly contributing into this project. If you're a SQL person on a data science person this is a chance to pick up or deepen skills in these areas. We've put together a bunch of help wanted issues that you could pick up to learn about unit testing, stored procedure writing, model development and more.

If you're ready to go and you'd like to get started, hit the Fork button now.

If you'd like to help but aren't quite sure how to get started you can also chat to us on Slack and we'll mentor you through the process.

Notes

GDPR requires us to be able to tell people how we arrived at a decision so we need to be able to retrieve models and values from different points in time.
From experience, we recommend using the ODBC drivers for interacting with SQL Server from R rather than the Windows native SQL Server driver as varbinary representations of models can have problems being transported to the DB using this driver.
I have structured things to work on the basis that either single predictions or batches of predictions will be made with consecutive IDs. This means if you do batches of say a 1000 customers at a time, that I'm expecting IDs 1-1000 first, then 1001-2000 and so on rather some other selection mechanism for batching customers.
All interactions in the project are governed by our Code of Conduct and work is contributed as MIT

lockedata / mssql_ml_elf

readme

Execution and Logging Framework (ELF) 🧝‍♀️

Example use

Key components

`[model]`

`[prediction]`

`[prediction_log]`

General

Compatibility

Installing R packages

Command line

R

Contributing

Notes