Initial comments from @jphall663

Initial comments from @jphall663 (received by email), will break down (maybe) later into several github issues, but posted now here to make sure proper credit is given.

Here are some comments and some materials:

"Categorical variables: some modeling tools require transformation to numeric (e.g. one-hot encoding)" ... this is something to think about - should you use software (h2o) or models (e.g. Decision Tree) that work well with categorical data natively? ... my instinct is probably yes in most business scenarios.
For unbiased evaluation I always like to point out you can over train on test data as well ... see what happens when a kaggle competition ends and people fall way down the leader board when their model is scored on a new test set ... i think this happens because people inadvertently inject too much human intelligence into their model which is not available when the model sees truly new data. For a more formal description see: https://research.googleblog.com/2015/08/the-reusable-holdout-preserving.html
Under model deployment: mention docker? mention real-time updates/online models? Also I would say some tasks are better for REST and some are better for batch. You want to know what coupons to mail this week - that's batch and in-DB, typically apps/applications are what need REST. So I personally would phrase it as "think about your scoring paradigm before you get started."
What about business rules? Very common to have business rules on top of predictions ... to prevent sending prescription discounts to dead people etc.
In evaluate and monitor, people need to be realistic about how many models they have. If you have one or two - this can be a manual process or using some home-baked code etc. But what if you have thousands of models? You can't realistically use git to manage AND monitor analytically. Things get really crazy when you have hundreds of people, using/changing hundreds of models, on hundreds of different data sets ... I personally feel that people always underestimate this part of the project ... Because there needs to be detailed admin work done here just like on the DB side AND the models have to be monitored analytically, not just for code changes. I hear "we will just use git" all the time ...
"ML needs to be "sold" to the business side" - personally I think it's better when ML is requested and supported by the business, i.e. driven from the business side.
I think another thing to point out is there are 2 different kinds of companies who have this down to a science - one is of course the big web companies, but the other group is banks and insurers - and what you are describing may work there, but they have all these proprietary systems [omitted] set up to help with all the regulation, documentation, model management/monitoring needs. Especially in the risk domain, things are just waaay different than in web company X. I do think they all want to move toward using more open source though ...

Here are some things I might also reference:

This booklet is about doing ML in BIG companies ... current best practices - nothing sexy: http://www.oreilly.com/data/free/the-evolution-of-analytics.csp [omitted]
Evaluating Machine Learning Models: http://www.oreilly.com/data/free/evaluating-machine-learning-models.csp
"11 Clever Methods of Overfitting and how to avoid them" - http://www.kdnuggets.com/2015/01/clever-methods-overfitting-avoid.html
Another one by me where I rant about people obsessing over test error instead of measuring business impact: https://www.oreilly.com/ideas/the-preoccupation-with-test-error-in-applied-machine-learning

@jphall663 Re your comments:

"Categorical variables: some modeling tools require transformation to numeric 
(e.g. one-hot encoding)" ... this is something to think about - should you use
 software (h2o) or models (e.g. Decision Tree) that work well with categorical 
data natively? ... my instinct is probably yes in most business scenarios.

+1. It's less code to maintain for the modeler and also the tool debs can make many optimizations based on that info. The next best option is if modeler does the 1-hot enc, but the tool support sparse representation of matrices. The worse is if it does not and you have to use dense representation. That might cause 10x slowdown in training and 10x larger RAM footprint.

For unbiased evaluation I always like to point out you can over train on test data
 as well ... see what happens when a kaggle competition ends and people fall 
way down the leader board when their model is scored on a new test set ... 
i think this happens because people inadvertently inject too much human intelligence 
into their model which is not available when the model sees truly new data. For a 
more formal description see: https://research.googleblog.com/2015/08/the-reusable-holdout-preserving.html

Yep.

Under model deployment: mention docker? mention real-time updates/online 
models? Also I would say some tasks are better for REST and some are better
 for batch. You want to know what coupons to mail this week - that's batch and 
in-DB, typically apps/applications are what need REST. So I personally would
 phrase it as "think about your scoring paradigm before you get started."

Yes, agreed. So far this repo started on a very high-level, but as we progress (develop the content) these things are important to mention.

What about business rules? Very common to have business rules on top of 
predictions ... to prevent sending prescription discounts to dead people etc.

Yeah, good point. This is why ML, biz and eng need to work together. Unfortunately the downside is that it also create dependencies/entanglement.

In evaluate and monitor, people need to be realistic about how many models
 they have. If you have one or two - this can be a manual process or using some
 home-baked code etc. But what if you have thousands of models? You can't
 realistically use git to manage AND monitor analytically. Things get really crazy
 when you have hundreds of people, using/changing hundreds of models, on 
hundreds of different data sets ... I personally feel that people always underestimate 
this part of the project ... Because there needs to be detailed admin work done
 here just like on the DB side AND the models have to be monitored analytically, 
not just for code changes. I hear "we will just use git" all the time ...

Yes. I don't have too much experience with this, but maybe an option to consider would be instead of having a model say by country (~100 models), maybe you can add country as a variable. A model such as GBM could "figure out what's the best for each country" though if some countries have more data than other (which is often the case) you can run in the unbalanced class problem (maybe do stratified sampling to mitigate?).

There is a tradeoff: 1 global model vs N "local" models. @jphall663: Is there any literature on that? I can also imagine having both and taking a linear combination (maybe with the coeff as hyperparameter and optimized via cross-validation).

"ML needs to be "sold" to the business side" - personally I think it's better 
when ML is requested and supported by the business, i.e. driven from the business side.

Well, this is the first point where maybe we don't agree 100% :) Maybe biz "drives" ML, but still biz does not have the knowledge to develop ML models (or even what's feasible or not), you need data scientists or whatever we call them. It can be argued that an DS/ML department (and please let's not call them AI :)) that has biz affinity would maybe be better in charge, in which case they would still need to "sell" their results, no?

I think another thing to point out is there are 2 different kinds of companies who
 have this down to a science - one is of course the big web companies, but the 
other group is banks and insurers - and what you are describing may work there, 
but they have all these proprietary systems [omitted] set up to help with all the 
regulation, documentation, model management/monitoring needs. Especially in 
the risk domain, things are just waaay different than in web company X. I do think 
they all want to move toward using more open source though ...

Yeah, there are also several companies providing "frameworks" for ML dev/deploy/manage etc. that would be also interesting to discuss here (I'm biased towards open source and from what I see h2o.ai looks the most promising).

szilard / ml-prod

Initial comments from @jphall663 #3