Pipeline TODOs

[x] data created out of repo, versioned somehow
[x] split data into train, validate test
[x] train model, persist model artifacts
[x] evaluate model, persist (and version) metrics; iterate, e.g. a la https://github.com/rasbt/machine-learning-book/tree/main/ch06 diagram
[x] choose best model (non-automated)
[x] train best model on train+validate data
[x] evaluate on hold out set for generalization error (confusion matrix? -> business impact?)

Desiderata

[x] Deterministic, e.g. fixed seeds, example of bank regulator and market risk scenario simulations
[x] Transparent parametrization
[x] As automated as makes sense

Resources

https://christophergs.com/python/2020/04/12/python-tox-why-use-it-and-tutorial/

Automated testing

Why test?

Risk management! See SRE book.

Dopamine hits

pre-ci-badge

post-ci-badge

Testing what?

For us, the "application" is a repository with slides and example notebooks that should just work. This means a notebook that does not run given the code in the versioned repository is a bug.

Beware false positives and confidence

No such thing as robust, fully automated testing--have to understand meaning of application, getting 100% test coverage not enough.

Above green check was false positive, as our "application" includes notebooks that should just work, was not the case.

--> Add test that notebooks actually run without error.

true-negative-notebook-tests-fail

Examples

from running notebooks-run test

``console 9 from graphviz import Digraph, Graph 11 # The below are packages I used to solve these exercises, but may be safely removed ---> 12 from causalgraphicalmodels import CausalGraphicalModel 13 from fake_data_for_learning import BayesianNodeRV, SampleValue, FakeDataBayesianNetwork

ModuleNotFoundError: No module named 'causalgraphicalmodels' ModuleNotFoundError: No module named 'causalgraphicalmodels'


```console
import os
from pathlib import Path

import numpy as np
import pandas as pd

# Only needed to generate graphs, may be safely ommitted
# once you comment out relevant cells below
from graphviz import Digraph, Graph

# The below are packages I used to solve these exercises, but may be safely removed
from causalgraphicalmodels import CausalGraphicalModel
from fake_data_for_learning import BayesianNodeRV, SampleValue, FakeDataBayesianNetwork
------------------

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Input In [1], in <cell line: 13>()
     11 # The below are packages I used to solve these exercises, but may be safely removed
     12 from causalgraphicalmodels import CausalGraphicalModel
---> 13 from fake_data_for_learning import BayesianNodeRV, SampleValue, FakeDataBayesianNetwork

ImportError: cannot import name 'BayesianNodeRV' from 'fake_data_for_learning' (/Users/pauldev/.virtualenvs/rl/lib/python3.9/site-packages/fake_data_for_learning/__init__.py)
ImportError: cannot import name 'BayesianNodeRV' from 'fake_data_for_learning' (/Users/pauldev/.virtualenvs/rl/lib/python3.9/site-packages/fake_data_for_learning/__init__.py)

Other benefits

Risk managment: can make changes / improvements with confidence.

e.g. removing unneeded dependencies that slow down build

Psychology aspect: keep your amygdala calm, and your creative juices flowing.

With DVC

Adapting https://dvc.org/doc/start/data-pipelines

Also have TW code from

One attempt along the way ...

dvc run -n prepare \
          -p prepare.seed,prepare.split \
          -d model-selection/prepare.py -d notebooks/data/default.csv\
          -o notebooks/data/prepared \
          python model-selection/prepare.py notebooks/data/default.csv

Surprise, surprise, bits missing that are already there in the example repo ... like

> model-selection/prepare.py notebooks/data/default.csv
/bin/bash: model-selection/prepare.py: Permission denied
ERROR: failed to run: model-selection/prepare.py notebooks/data/default.csv, exited with 126
(rl) (base) Pauls-MacBook-Air:risk-ai-workshop pauldev$ cd model-selection/
(rl) (base) Pauls-MacBook-Air:model-selection pauldev$ ls -al
total 0
drwxr-xr-x   3 pauldev  staff    96 Mar 19 16:28 .
drwxr-xr-x  33 pauldev  staff  1056 Mar 19 16:31 ..
-rw-r--r--   1 pauldev  staff     0 Mar 19 16:23 prepare.py
(rl) (base) Pauls-MacBook-Air:model-selection pauldev$ chmod u+x prepare.py

(rl) (base) Pauls-MacBook-Air:risk-ai-workshop pauldev$ dvc run -n prepare           -p prepare.seed,prepare.split           -d model-selection/prepare.py -d notebooks/data/default.csv          -o data/prepared           model-selection/prepare.py notebooks/data/default.csv
WARNING: 'model-selection/prepare.py' is empty.
WARNING: 'model-selection/prepare.py' is empty.
Running stage 'prepare':
> model-selection/prepare.py notebooks/data/default.csv
WARNING: 'model-selection/prepare.py' is empty.

Getting it to worl

From model-selection directory, as I think params need to be in same directory as one where dvc run is called from ...

prepare

dvc run -f -n prepare \
    -p prepare.seed,prepare.train_split,prepare.test_split,prepare.target_col \
    -d prepare.py -d ../notebooks/data/default.csv\
    -o ../notebooks/data/prepared \
    python prepare.py ../notebooks/data/default.csv

Once I set the random_state param in train_test split, got

(rl) (base) Pauls-MacBook-Air:model-selection pauldev$ dvc repro
'../notebooks/data/default.csv.dvc' didn't change, skipping           
Stage 'prepare' didn't change, skipping
Data and pipelines are up to date.

Have chicken-and-egg with (local) folders for outputs. I add them to python script called, but they have to exist before dvc run it seems:

dvc run -f -n train \
    -p prepare.seed \
    -d train -d ../notebooks/data/prepared \
    -o ../notebooks/data/train \
    python train.py

ERROR: unexpected error - [Errno 2] No such file or directory: '/Users/pauldev/delo/projects/risk-ai-workshop/model-selection/train'

NOPE it is actually the dvc-required folder that needs to exist before doing run ???, see path in error.

DX issues

I didn't see on the file-system the data file outputs from the first stage -> I couldn't iterate on the 2nd stage. Maybe this was a caching thing??? I just set dvc config core.autostage true and now I see them. Maybe a coincidence.

DVC spike, continued

I think the reason the pipeline outputs do not appear on-disk is because I have not set up a remote, and dvc likely (sensibly) does not guess for you or have a default of local file system.

NOPE:

(rl) (base) Pauls-MacBook-Air:model_selection pauldev$ dvc remote add -d $PROJECT_ROOT/notebooks/data
ERROR: the following arguments are required: url
usage: dvc remote add [-h] [--global | --system | --project | --local] [-q | -v] [-d] [-f]
                      name url

Add a new data remote.
Documentation: <https://man.dvc.org/remote/add>

positional arguments:
  name           Name of the remote
  url            Remote location. See full list of supported URLs at
                 <https://man.dvc.org/remote>

optional arguments:
  -h, --help     show this help message and exit
  --global       Use global config.
  --system       Use system config.
  --project      Use project config (.dvc/config).
  --local        Use local config (.dvc/config.local).
  -q, --quiet    Be quiet.
  -v, --verbose  Be verbose.
  -d, --default  Set as default remote.
  -f, --force    Force overwriting existing configs

GOTCHA

From https://dvc.org/doc/command-reference/run

Outputs are deleted from the workspace before executing the command (including at dvc repro) if their paths are found as existing files/directories (unless --outs-persist is used). This also means that the stage command needs to recreate any directory structures defined as outputs every time its executed by DVC.

Not-GOTCHA!!!

I was writing an empty metrics.json due to (ab)use of python generators; see 7bc06b98c8d43373a3bf07335c6d6815bf193e01

PA SE TO

Kind of a gotcha: I had the funny experience of having

one terminal open in an output directory, running a run whose final stage outputs to that directory.
Run a data pipeline whose final stage outputs to that directory
ls shows nothing in output directory
cd to parent and back to child (output dir)
ls shows output

(rl) (base) Pauls-MacBook-Air:evaluate_fit_train pauldev$ ls
(rl) (base) Pauls-MacBook-Air:evaluate_fit_train pauldev$ cd ..
(rl) (base) Pauls-MacBook-Air:data pauldev$ cd evaluate_fit_train/
(rl) (base) Pauls-MacBook-Air:evaluate_fit_train pauldev$ ls
metrics_decision_tree_classifier.json   metrics_logistic_regression.json

BEST GUESS AT EXPLANATION: https://itectec.com/superuser/linux-how-to-refresh-directory-in-bash/

Your script is most likely removing the directory, and not just the files which are there. So, when you have cd'd into it, and the directory is removed, you do ls on a directory which does not actually exist.

If true, maybe add as a FAQ / troubleshooting to dvc docs???

dvc add and git add example

As part of my attempts to figure out why I had no pipeline outputs on disk, I removed the default data from normal version control and put it instead under dvc's remit. This broke my existing test setup (FIXME, change ci test setup to not need default data in version control???).

To undo this swap of ownership:

(rl) (base) Pauls-MacBook-Air:risk-ai-workshop pauldev$ dvc remove notebooks/data/default.csv.dvc
(rl) (base) Pauls-MacBook-Air:risk-ai-workshop pauldev$ git add notebooks/data/default.csv

Model selection

https://scikit-learn.org/stable/modules/tree.html, also visualization with graphviz.

DVC again

It seems dvc metrics show only supports nesting one level deep (which makes sense, otherwise it would have to guess where in the tree the right key-value pair was), so either

do one model family / hyperparameter combination per run per metric file to keep high level dvc metric show OR

do multiple family / hyperparameter combination per run and

output to the same file, breaking all dvc metric show functionality or
output to different metric files and use dvc metric show <target-path>, like

(rl) (base) Pauls-MacBook-Air:model_selection pauldev$ dvc metrics show /Users/pauldev/delo/projects/risk-ai-workshop/notebooks/data/evaluate_fit_train/metrics_decision_tree_classifier.json
Path                                                                           avg_prec
../../notebooks/data/evaluate_fit_train/metrics_decision_tree_classifier.json  0.45737

DVC metrics

Current opinion: not worth the effort. When I git added + commited metric files, and ran dvc repro, I got the error

> python evaluate.py --stage_name evaluate_fit_train
    gender default
    occupation default
    activity default
    default default
    Target column, not a feature. Skipping.
    ERROR: failed to reproduce 'evaluate_fit_train':  output '../../notebooks/data/evaluate_fit_train' is already tracked by SCM (e.g. Git).
        You can remove it from Git, then add to DVC.
            To stop tracking from Git:
                git rm -r --cached '../../notebooks/data/evaluate_fit_train'
                git commit -m "stop tracking ../../notebooks/data/evaluate_fit_train"

Instead, will just git version the metric files and use git diff

diff --git a/notebooks/data/evaluate_fit_train/metrics_logistic_regression.json b/notebooks/data/evaluate_fit_train/metrics_logistic_regression.json
index 05f925d..c98219d 100644
--- a/notebooks/data/evaluate_fit_train/metrics_logistic_regression.json
+++ b/notebooks/data/evaluate_fit_train/metrics_logistic_regression.json
@@ -1,5 +1,5 @@
 {
-    "avg_precision": 0.8248723078078876,
-    "mean_female_score": 0.4965220349225026,
-    "mean_male_score": 0.6862469945956717
+    "avg_precision": 0.8310431316123863,
+    "mean_female_score": 0.49637093232244045,
+    "mean_male_score": 0.6888864888752284
 }

Done at commit 8074b2d9398a199eda380f81a6d2f669aad5a432

munichpavel / risk-ai-workshop

Examples of pipeline-approach (e.g. with MLOps) vs human-duct-tape approach #8