lockedata / mssql_ml_elf

Execution and Logging Framework (ELF) šŸ§ā€ā™€ļø for ML models in SQL Server
MIT License
6 stars 2 forks source link
logging monitoring mssql python rstats sql-server

Project Status: WIP ā€“ Initial development is in progress, but there
has not yet been a stable, usable release suitable for the
public. Locke Data
Slack

GitHub issues by-label GitHub commit activity GitHub contributors

MIT license GitHub All ReleasesGitHub release

Execution and Logging Framework (ELF) šŸ§ā€ā™€ļø

This project can be used as the basis for an operational ML training and prediction capability in SQL Server and Azure SQL. This functionality is bundled up in a database project so that it can be deployed against different targets and managed under source control.

As everyone has their own set of requirements this can be used as-is or as a starting point for discussion as to what your requirements are and what the resulting data model should look like.

Example use | Key components | General

Example use

The [samples] schema shows how the base [ml] components can be used in a realistic scenario to train a feature engineering model and a predictive model, store these and then use them in a predictive model scenario.

Key components

[model]

Store any reusable object like a predictive model along with information about how the object was created

The [model] table is a temporal table that contains some sort of model and information aout how that model was developed.

A temporal table is a SQL Server 2016+ feature that maintains a queryable record of changes to values stored in the table. This means we can have a model like churn_likelihood that we update on a regular basis. The older versions of the model object and the measures of fit associated with it will be retrievable for performing what-if analysis, tracking model or feature drift, or providing a record with how automated decisions (see Note 1) were made at a given point in time. Use TEMPORAL TABLE SQL queries to gain insight into model changes over time.

The table contains columns for storing:

To help people publish models from different capabilities (see Note 2), there is an upsert_model stored procedure that takes the relevant information required as inputs and will handle inserting the model if a new name is provided or updating the model if the name already exists.

[prediction]

Store the results of any prediction request here with information about how the result was derived

The [prediction] table is designed as a flexible log store for results of prediction workloads. You can directly add predictions etc to other tables instead of or as well as this table.

The table contains columns for storing:

To make it easier to implement prediction logging, there is an insert_prediction stored procedure to facilitate stored procedure development.

[prediction_log]

Track prediction requests

The [prediction_log] table is for storing prediction requests and associating them with any errors or batch of inserted predictions.

The table contains columns for storing:

This log table should be useful for assessing prediction workload patterns, tying back predictions to requests, and tracking problems with the the request process.

To facilitate use of the log table, two stored procedures have been created to ease prediction stored procedure development and reduce code duplication. The begin_prediction_log and end_prediction_log can top and tail prediction processes.

General

Compatibility

As this is intended to surround the ML Services components, you can only deploy to 2016 and beyond but due to the merging nature of the feature set in SQL Server, the version you have determines the variety of workloads and languages you will be able to deploy.

Installing R packages

To run the samples and similar stored procedures, any R dependencies need to be installed. One of the easiest ways is with the sqlmlutils package.

Command line

git clone git@github.com:microsoft/sqlmlutils.git

Then open RStudio

R

install.packages("remotes")
remotes::install_local("sqlmlutils/R/")
library(sqlmlutils)
connection <- connectionInfo(
  server= "rinsqldb.database.windows.net",
  database = "rinsqldb",
  uid = "",
  pwd = "")

sql_install.packages(connectionString = connection, 
    pkgs =c("tidyverse","recipes","sessioninfo","rsample","broom","jsonlite"), 
    verbose = TRUE, scope = "PUBLIC")

Contributing

Thanks for reading this section! Hopefully you're interested in possibly contributing into this project. If you're a SQL person on a data science person this is a chance to pick up or deepen skills in these areas. We've put together a bunch of help wanted issues that you could pick up to learn about unit testing, stored procedure writing, model development and more.

If you're ready to go and you'd like to get started, hit the Fork button now. GitHub forks

If you'd like to help but aren't quite sure how to get started you can also chat to us on Slack and we'll mentor you through the process.

Notes

  1. GDPR requires us to be able to tell people how we arrived at a decision so we need to be able to retrieve models and values from different points in time.
  2. From experience, we recommend using the ODBC drivers for interacting with SQL Server from R rather than the Windows native SQL Server driver as varbinary representations of models can have problems being transported to the DB using this driver.
  3. I have structured things to work on the basis that either single predictions or batches of predictions will be made with consecutive IDs. This means if you do batches of say a 1000 customers at a time, that I'm expecting IDs 1-1000 first, then 1001-2000 and so on rather some other selection mechanism for batching customers.
  4. All interactions in the project are governed by our Code of Conduct and work is contributed as MIT