Closed munichpavel closed 2 years ago
Risk management! See SRE book.
For us, the "application" is a repository with slides and example notebooks that should just work. This means a notebook that does not run given the code in the versioned repository is a bug.
No such thing as robust, fully automated testing--have to understand meaning of application, getting 100% test coverage not enough.
Above green check was false positive, as our "application" includes notebooks that should just work, was not the case.
--> Add test that notebooks actually run without error.
from running notebooks-run test
``console 9 from graphviz import Digraph, Graph 11 # The below are packages I used to solve these exercises, but may be safely removed ---> 12 from causalgraphicalmodels import CausalGraphicalModel 13 from fake_data_for_learning import BayesianNodeRV, SampleValue, FakeDataBayesianNetwork
ModuleNotFoundError: No module named 'causalgraphicalmodels' ModuleNotFoundError: No module named 'causalgraphicalmodels'
```console
import os
from pathlib import Path
import numpy as np
import pandas as pd
# Only needed to generate graphs, may be safely ommitted
# once you comment out relevant cells below
from graphviz import Digraph, Graph
# The below are packages I used to solve these exercises, but may be safely removed
from causalgraphicalmodels import CausalGraphicalModel
from fake_data_for_learning import BayesianNodeRV, SampleValue, FakeDataBayesianNetwork
------------------
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Input In [1], in <cell line: 13>()
11 # The below are packages I used to solve these exercises, but may be safely removed
12 from causalgraphicalmodels import CausalGraphicalModel
---> 13 from fake_data_for_learning import BayesianNodeRV, SampleValue, FakeDataBayesianNetwork
ImportError: cannot import name 'BayesianNodeRV' from 'fake_data_for_learning' (/Users/pauldev/.virtualenvs/rl/lib/python3.9/site-packages/fake_data_for_learning/__init__.py)
ImportError: cannot import name 'BayesianNodeRV' from 'fake_data_for_learning' (/Users/pauldev/.virtualenvs/rl/lib/python3.9/site-packages/fake_data_for_learning/__init__.py)
Risk managment: can make changes / improvements with confidence.
e.g. removing unneeded dependencies that slow down build
Psychology aspect: keep your amygdala calm, and your creative juices flowing.
Adapting https://dvc.org/doc/start/data-pipelines
Also have TW code from
One attempt along the way ...
dvc run -n prepare \
-p prepare.seed,prepare.split \
-d model-selection/prepare.py -d notebooks/data/default.csv\
-o notebooks/data/prepared \
python model-selection/prepare.py notebooks/data/default.csv
Surprise, surprise, bits missing that are already there in the example repo ... like
> model-selection/prepare.py notebooks/data/default.csv
/bin/bash: model-selection/prepare.py: Permission denied
ERROR: failed to run: model-selection/prepare.py notebooks/data/default.csv, exited with 126
(rl) (base) Pauls-MacBook-Air:risk-ai-workshop pauldev$ cd model-selection/
(rl) (base) Pauls-MacBook-Air:model-selection pauldev$ ls -al
total 0
drwxr-xr-x 3 pauldev staff 96 Mar 19 16:28 .
drwxr-xr-x 33 pauldev staff 1056 Mar 19 16:31 ..
-rw-r--r-- 1 pauldev staff 0 Mar 19 16:23 prepare.py
(rl) (base) Pauls-MacBook-Air:model-selection pauldev$ chmod u+x prepare.py
(rl) (base) Pauls-MacBook-Air:risk-ai-workshop pauldev$ dvc run -n prepare -p prepare.seed,prepare.split -d model-selection/prepare.py -d notebooks/data/default.csv -o data/prepared model-selection/prepare.py notebooks/data/default.csv
WARNING: 'model-selection/prepare.py' is empty.
WARNING: 'model-selection/prepare.py' is empty.
Running stage 'prepare':
> model-selection/prepare.py notebooks/data/default.csv
WARNING: 'model-selection/prepare.py' is empty.
From model-selection
directory, as I think params need to be in same directory as one where dvc run
is called from ...
dvc run -f -n prepare \
-p prepare.seed,prepare.train_split,prepare.test_split,prepare.target_col \
-d prepare.py -d ../notebooks/data/default.csv\
-o ../notebooks/data/prepared \
python prepare.py ../notebooks/data/default.csv
Once I set the random_state param in train_test split, got
(rl) (base) Pauls-MacBook-Air:model-selection pauldev$ dvc repro
'../notebooks/data/default.csv.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Data and pipelines are up to date.
Have chicken-and-egg with (local) folders for outputs. I add them to python script called, but they have to exist before dvc run
it seems:
dvc run -f -n train \
-p prepare.seed \
-d train -d ../notebooks/data/prepared \
-o ../notebooks/data/train \
python train.py
ERROR: unexpected error - [Errno 2] No such file or directory: '/Users/pauldev/delo/projects/risk-ai-workshop/model-selection/train'
NOPE it is actually the dvc-required folder that needs to exist before doing run ???, see path in error.
dvc config core.autostage true
and now I see them. Maybe a coincidence.I think the reason the pipeline outputs do not appear on-disk is because I have not set up a remote, and dvc likely (sensibly) does not guess for you or have a default of local file system.
NOPE:
(rl) (base) Pauls-MacBook-Air:model_selection pauldev$ dvc remote add -d $PROJECT_ROOT/notebooks/data
ERROR: the following arguments are required: url
usage: dvc remote add [-h] [--global | --system | --project | --local] [-q | -v] [-d] [-f]
name url
Add a new data remote.
Documentation: <https://man.dvc.org/remote/add>
positional arguments:
name Name of the remote
url Remote location. See full list of supported URLs at
<https://man.dvc.org/remote>
optional arguments:
-h, --help show this help message and exit
--global Use global config.
--system Use system config.
--project Use project config (.dvc/config).
--local Use local config (.dvc/config.local).
-q, --quiet Be quiet.
-v, --verbose Be verbose.
-d, --default Set as default remote.
-f, --force Force overwriting existing configs
GOTCHA
From https://dvc.org/doc/command-reference/run
Outputs are deleted from the workspace before executing the command (including at dvc repro) if their paths are found as existing files/directories (unless --outs-persist is used). This also means that the stage command needs to recreate any directory structures defined as outputs every time its executed by DVC.
Not-GOTCHA!!!
I was writing an empty metrics.json
due to (ab)use of python generators; see 7bc06b98c8d43373a3bf07335c6d6815bf193e01
PA SE TO
Kind of a gotcha: I had the funny experience of having
ls
shows nothing in output directorycd
to parent and back to child (output dir)ls
shows output(rl) (base) Pauls-MacBook-Air:evaluate_fit_train pauldev$ ls
(rl) (base) Pauls-MacBook-Air:evaluate_fit_train pauldev$ cd ..
(rl) (base) Pauls-MacBook-Air:data pauldev$ cd evaluate_fit_train/
(rl) (base) Pauls-MacBook-Air:evaluate_fit_train pauldev$ ls
metrics_decision_tree_classifier.json metrics_logistic_regression.json
BEST GUESS AT EXPLANATION: https://itectec.com/superuser/linux-how-to-refresh-directory-in-bash/
Your script is most likely removing the directory, and not just the files which are there. So, when you have cd'd into it, and the directory is removed, you do ls on a directory which does not actually exist.
If true, maybe add as a FAQ / troubleshooting to dvc docs???
As part of my attempts to figure out why I had no pipeline outputs on disk, I removed the default data from normal version control and put it instead under dvc's remit. This broke my existing test setup (FIXME, change ci test setup to not need default data in version control???).
To undo this swap of ownership:
(rl) (base) Pauls-MacBook-Air:risk-ai-workshop pauldev$ dvc remove notebooks/data/default.csv.dvc
(rl) (base) Pauls-MacBook-Air:risk-ai-workshop pauldev$ git add notebooks/data/default.csv
https://scikit-learn.org/stable/modules/tree.html, also visualization with graphviz.
It seems dvc metrics show
only supports nesting one level deep (which makes sense, otherwise it would have to guess where in the tree the right key-value pair was), so either
dvc metric show
ORdo multiple family / hyperparameter combination per run and
dvc metric show
functionality ordvc metric show <target-path>
, like(rl) (base) Pauls-MacBook-Air:model_selection pauldev$ dvc metrics show /Users/pauldev/delo/projects/risk-ai-workshop/notebooks/data/evaluate_fit_train/metrics_decision_tree_classifier.json
Path avg_prec
../../notebooks/data/evaluate_fit_train/metrics_decision_tree_classifier.json 0.45737
Current opinion: not worth the effort. When I git added + commited metric files, and ran dvc repro
, I got the error
> python evaluate.py --stage_name evaluate_fit_train
gender default
occupation default
activity default
default default
Target column, not a feature. Skipping.
ERROR: failed to reproduce 'evaluate_fit_train': output '../../notebooks/data/evaluate_fit_train' is already tracked by SCM (e.g. Git).
You can remove it from Git, then add to DVC.
To stop tracking from Git:
git rm -r --cached '../../notebooks/data/evaluate_fit_train'
git commit -m "stop tracking ../../notebooks/data/evaluate_fit_train"
Instead, will just git version the metric files and use git diff
diff --git a/notebooks/data/evaluate_fit_train/metrics_logistic_regression.json b/notebooks/data/evaluate_fit_train/metrics_logistic_regression.json
index 05f925d..c98219d 100644
--- a/notebooks/data/evaluate_fit_train/metrics_logistic_regression.json
+++ b/notebooks/data/evaluate_fit_train/metrics_logistic_regression.json
@@ -1,5 +1,5 @@
{
- "avg_precision": 0.8248723078078876,
- "mean_female_score": 0.4965220349225026,
- "mean_male_score": 0.6862469945956717
+ "avg_precision": 0.8310431316123863,
+ "mean_female_score": 0.49637093232244045,
+ "mean_male_score": 0.6888864888752284
}
Done at commit 8074b2d9398a199eda380f81a6d2f669aad5a432
Pipeline TODOs
Desiderata
Resources