About: Short document summarising Ian's thoughts on successful ways to ship working, maintainable and understandable data science products and ways to avoid falling into dark holes of despair. This is based on my experience, your experience may be very different - if so, file a Bug for me in GitHub and give me something to chew on.
Very roughly these are "notes from me to my-younger-self", I hope you find some of this to be useful. I've put this together based on my talking, teaching and coaching along with feedback from chats at our PyDataLondon monthly meetup.
Put your email in here for updates, I'll only mail about updates to this doc.
By: Ian Ozsvald (http://ianozsvald.com - LinkedIn) of ModelInsight (do get in contact if consulting and coaching might be useful)
License: Creative Commons By Attribution
Location: https://github.com/ianozsvald/data_science_delivered
Aimed at: Existing data scientists, both for those who are engineers and those who are researchers.
Notes on the associated Jupyter Notebooks:
overview of the major stages in a data project (from conception to maintenance)
outline the projects
take a look at the data
determine what's feasible, define milestones, get buy-in
deliver a working prototype
deliver a deployable, maintainable and testable system
support the solution
early stages
sources of data
discovery - what's feasible? what's valuable? how far might this project go?
dirty data
cleaning data is necessary, nothing else works if the data is not clean
cleaning data is an on-going process
low quality data breaks everything
don’t patch over bad data, you forget about it and it never gets better, then you rely on the assumptions and things go bad
write validators that check for bad data - run them regularly, report exceptions to the specification, treat this as a red flag event and try to get to the source of the problem (and patch up broken data before you forget about it)
exploration
delivering data products
How a project can evolve
Get your data (and expect to have to fetch more, it is fine to start with a sensible subset of the possible data - aim for Small Data that fits in RAM at the start to keep your exploration speed high)
Hacking, scripts (knock out lots of ideas)
Reporting results (tell a story, get someone to validate your assumptions, do a code review as a sanity check)
Using source-control (e.g. github.com), don't be slow to keep a history of edits and to sync off-site
Pickles (you'll have partial results you want to store, a pickle
is a reasonable way to store early-stage results)
Configuration (make it run on other machines e.g. for colleagues or for deployment or other configurations like test, staging and deployment)
Modules (use __init__.py
in folders to group code into logical units)
Packages (make your code distributable using setup.py
)
Testing (check unittest.py
and py.test)
unittest.py
tests
py.test -s
to see stdout
(which otherwise is hidden unless the tests fail) and py.test -s -pdb
to drop into the Python debugger on a failure (so you get dropped in to pdb to do live debugging)Data integrity tests (make sure your data actually contains what you expect - you can easily ignore this step and easily write a solution that lies to you - be very wary of skipping this)
Speed of iteration (it gets slower as the code gets bigger, it is also likely that more bugs slip in which occasionally will really slow you down)
Repeatable processes
Online data (we've gone from a static lump of data to needing to update our models with new data - how do we do this and deploy it within our unique operating constraints?)
Big Data (maybe you really have a Big Data problem - now you've figured out how to solve the problem on a subset of the data you need to figure out how to deploy this in a cluster)
the cost of bad data
we all have bad data, we sweep it under the carpet
bad data should be treated like bad logic - it’ll torpedo your project so it must be fixed
bad data will keep creeping back in from new data sources and from existing data sources and from legacy data sources, if you don't monitor for it then you won't see this (and later it'll just be painful to deal with)
you have to actively monitor it, report on it and fix it
bad data will add a "development tax" on your work, it'll incrementally slow you down and over time it can cause your project to stall when the whole teams switches for months just to clean the data so it can be useful again
make it a red-light scenario when your data goes bad and fix it else you’ll keep running slower and slower (just like you spend more time fire-fighting if you don’t bother with a good unit-test and test management process)
O’Reilly’s Bad Data Handbook
Some general notes on cost of bad data:
dealing with disparate or hard-to-access data
dealing with dirty data
what clean data might look like (this is very problem specific!)
null
and 0
to represent missing and zero valuesR&D on dirty data
talks on dirty data
dirty text data
cleaning broken text
ftfy to fix bad encodings
poor specification of encoding - requires some checking on your part to make sensible guesses - often text "looks like" UTF8 but might actually be encoded in Windows CP-1252 which encodes smart-quotes differently to disk
Chromium Compact Language Detector - identifies human language type
HTML entity decoding (Python's unescape does a sensible job for the basic entities)
normalising unicode variants (text, punctuation)
and &
, copyright ©
etcnormalising synonyms
normalising text terms
normalising weights and measures
astropy etc?
no tool to recognise these?
encoding text categories for machine learning
dirty numeric data
numeric outliers for normally distributed numbers - checking the values outside of a couple of standard deviations is a sane starting point
checking for outliers in a non-stationery dataset (e.g. a time series of prices) is trickier (READER - what's a good starting point?)
check for text-encoded numbers like NaN and INFINITY, they confuse things when parsed!
solving problems with data
O'Reilly's notes on Evaluating Machine Learning Models
a rough process for machine learning:
classification
diagnosis tips:
Andrew Ng's notes http://cs229.stanford.edu/materials/ML-advice.pdf
"A Few Useful Things to Know about Machine Learning", Pedro Domingos - really useful paper to clear up some ML assumptions
metrics http://www.win-vector.com/blog/2015/09/willyourmodelworkpart4/
Unboxing the Random Forest Classifier: The Threshold Distributions
Visualise a confusion matrix for the resulting classifications.
Visualise a correlation matrix for your features if you have a small number (e.g. <20 features). You might also visualise the similarities as a force network graph (photo) using NetworkX or Gephi.
Use a Dummy classifier which classifies based on e.g. the most frequent class seen in the training set, this should reflect the underlying distribution of your classes and any classifier and features you build must outperform this (else they're not exploiting any information that may exist!)
What classifications are always wrong? Train on your training set and then use either your train or your test set to diagnose which labels in incorrectly predicts (e.g. for a binary classification task take a highly confident wrong class answer from your test set). What's missing? Poor features? Maybe the model is too simplistic? Maybe you have bad labels?
Which classifications always sit on the decision boundary (e.g. items with a 50/50 probability of being in one of two classes)? Why can't the model confidently move the examples to the right class?
With many training runs you can plot the coefficients inside a classifier like Logistic Regression (using a boxplot per feature) to see the distribution of the weights. This should show if e.g. some of your feature data has a poor distribution (e.g. if it is non-Normal) and how variable the weights can be.
regression
diagnosis tips:
sqrt
or log
). With a heavy skew OLS can be skewed in favour of trying to fit the few outliers rather than the body of mostly-not-skewed values. Remember to measure your training error on the un-transformed data (else you can't compare it to the error on the non-transformed previous version)feature diagnosis
(Unit) Testing on scikit-learn models
natural language processing
NLTK useful for ngrams and tokenisation
it is faster to tokenise on whitespace yourself
SpaCy looks good
keep features human-readable to ease debugging
tokenising a lower-cased string with whitespace tokenisation and unigrams gets you a long way
for two-class text classification (e.g. spam, "is user of typeX") the following configuration is a sane base-case starting point (this is a Pattern that you might want to follow):
delivering working products
reporting
delivering repeatable read-only datasets
delivering read-only webservices
delivering online data products
KISS - explainable answers probably beat having the best score
deployment options to machines:
deployment technologies:
building a web-based API:
storing data
add constraints to your datastore whenever possible, probably they're granular (e.g. text only with N characters, ints-only and Nulls are allowed) with the wrong level of granularity (you might want lower-case hex-like ASCII UUID strings only or positive numbers for addresses within a certain range), but some constraints are much better than no constraints. Constraints between key fields in tables are also very sensible.
pickles of dicts
are good for persisting python objects
dict
along with your data (which might be numpy
arrays, dataframes
etc)MySQL
utf8
by default (so it only encodes the Basic Multilingual Plane and not the Supplementary Planes) and so silently loses data that doesn't look "Western-like", use utf8mb4
insteadMongoDB
dense and sparse matrices
engineering concerns (how stuff goes wrong)
not logging the right data
not automating your data-edit process for production data (you mustn’t edit this stuff by hand as you’ll have multiple environments over time)
not automating the logging/scraping of data
schemas change but not using a good change approach
field name changes
datatype changes
not having a sensible schema
mixing "" None Id(0) "none" "NOTPRESENT"
to all indicate a Null condition in the same dataset, this is a really bad idea and will lead to confusion. Fix on using only 1 Null value.
having different applications (e.g. MongoDB driven by C# and Python with different Legacy UUIDs) write data in similar but incompatible ways
data lakes are probably a better idea then never-finished perfect schemas
lack of monitoring and lack of acting on the monitored result
write data validator scripts and run them religiously
when data goes wrong, make it a high priority else it’ll poison later work
use testing to limit problems
coding sensible practices
OFFSET=53.9
and use OFFSET
throughout your code (rather than 53.9
). This is especially useful if you have two different magic numbers that mean differnets things (e.g. 10
used in two different contexts), having a spelt-out variable makes the intent much clearer when you return to this code months later.useful python tools
anaconda environments
ipython - interactive Python shell
jupyter notebook (and consider %connect_info
(sort-of example to connect a Notebook's kernel with a console interface)
argparse and os.env to get default configurations from environment
ipython_memory_usage - diagnose line-by-line RAM usage in an IPython session
jq - excellent JSON parser
IPython Notebook working practices
ipython
) is great for rapid R&D, typically I use an IPython shell and a VIM terminal as my only editor (VIM tends to exist on all platforms and over the wire and I like being old-skool)%qtconsole
(e.g. %qtconsole --style=linux
for unix colouring) and it'll open a QTConsole in a new window which shares the same kernel, so you can prototype (and get tooltips and tab completions) in the shell, then copy/paste the useful lines into your Notebook. Plots will work in both the shell and the Notebook - see details for QTconsole%matplotlib inline
to get static inline plots in the Notebook (as PNGs) or %matplotlib notebook
to get interactive graphics (e.g. tooltips, zooms) but note that each interactive graphic holds a heavyweight object, by default the Notebook limits to 20 (IIRC) plots before it gives you warningsuseful data cleaning tools
higher performance python
numpy
operations (particularly for-loops on numpy
arrays) using LLVM, this is a very sane first thing to trymultiprocessing
lets you parallelise stuff on a single machinemultiprocessing
and the exact same code will run on your laptop which eases your dev/debug cycleBuilding data teams
The business side of delivering data science products