I totally agree with @jphall663 [ https://github.com/szilard/ml-prod/issues/3 ] regarding the categorical variables. If there is a way to avoid their conversion to numeric type, then don't do it. It is better to use tools and models which handle categorical variables properly.
Additional idea to training, tuning part: Feature selection is very important if you work with high number of variables. Too much variables might cause running time and lack of memory problems and they also increase the noise in your data which could lead to overfitting. Even when using models with variable selection methods included (for example random forest), it is worth to decrease the amount of input variables to a reasonable size. But do the feature selection carefully. Simple heuristics examine the variables one by one and you might not notice important cross-relationships.
“Use same tool to deploy” - I completely agree with this one again, especially if you have to rewrite your code to other language regularly, after every model refreshment. In that case the bugs are guaranteed.
Model deployment - FE needs to be replicated: that was a problem in our system as well, but we modified the data pipeline to avoid it. When an input arrives from the live data we do the FE and store the result of it before calculating the score. Later on, at the model refreshment stage we can use the saved and preprocessed data, so we use no (or less) code duplication.
+ It would be interesting to expand all of these to unsupervised learning algorithms as well. The model evaluation is much more difficult because of the lack of labels. But if you use your models in production you should have an idea of how well your model works. The monitoring is even more critical in this case.
Comments from Eszter Windhager-Pokol, Senior Data Scientist @Balabit https://www.balabit.com/blog/author/wpe/
I totally agree with @jphall663 [ https://github.com/szilard/ml-prod/issues/3 ] regarding the categorical variables. If there is a way to avoid their conversion to numeric type, then don't do it. It is better to use tools and models which handle categorical variables properly.
Additional idea to training, tuning part: Feature selection is very important if you work with high number of variables. Too much variables might cause running time and lack of memory problems and they also increase the noise in your data which could lead to overfitting. Even when using models with variable selection methods included (for example random forest), it is worth to decrease the amount of input variables to a reasonable size. But do the feature selection carefully. Simple heuristics examine the variables one by one and you might not notice important cross-relationships.
“Use same tool to deploy” - I completely agree with this one again, especially if you have to rewrite your code to other language regularly, after every model refreshment. In that case the bugs are guaranteed.
Model deployment - FE needs to be replicated: that was a problem in our system as well, but we modified the data pipeline to avoid it. When an input arrives from the live data we do the FE and store the result of it before calculating the score. Later on, at the model refreshment stage we can use the saved and preprocessed data, so we use no (or less) code duplication.
+ It would be interesting to expand all of these to unsupervised learning algorithms as well. The model evaluation is much more difficult because of the lack of labels. But if you use your models in production you should have an idea of how well your model works. The monitoring is even more critical in this case.