mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
91 stars 65 forks source link

Add benchmark/dataset for classical ML algorithms #188

Open ksangeek opened 5 years ago

ksangeek commented 5 years ago

I don't see any datasets in MLPERF, which can be solved with classical machine learning algorithms ( e.g. Linear or Logistic Regression, Decision Trees, Random Forest etc.). Some examples of datasets I can reference here are :

  1. https://www.kaggle.com/c/criteo-display-ad-challenge/data for binary classification.
  2. https://www.kaggle.com/c/house-prices-advanced-regression-techniques for regression.

These would be useful for use in real-world scenarios where interpretability of the prediction is of utmost importance. Generalized Linear Models have a good share in the real world for this very reason! I did not find a reference which states that MLPERF is only for deep learning problems, so I think this kind of benchmark/dataset should be added for the democratization of these suit of benchmarks. Thanks!

psyhtest commented 5 years ago

I totally agree that ML != DL, but do you have any data on how widely these models are used in production?

ksangeek commented 5 years ago

Well, I think they target different problem space(though sometimes overlap). I can't confidently say much about the actual usage in production, but based on Kaggle survey 2018 I still see sizable importance given by data science practitioners to sklearn, random forest and xgboost. There are also new promising players like snapML and cuML which continue to invest in the classic machine learning space.

TheKanter commented 5 years ago

Facebook is quite public that they use gradient-boosted decision trees for sigma - their anomaly detector.

I would strongly support more traditional forms of ML.

David

On Sat, Mar 9, 2019 at 7:45 AM ksangeek notifications@github.com wrote:

Well, I think they target different problem space(though sometimes overlap). I can't confidently say much about the actual usage in production, but based on Kaggle survey 2018 https://www.kaggle.com/paultimothymooney/2018-kaggle-machine-learning-data-science-survey I still see sizable importance given by data science practitioners to sklearn, random forest and xgboost. There are also new promising players like snapML https://www.zurich.ibm.com/snapml/ and cuML https://rapids.ai/ which continue to invest in the classic machine learning space.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mlperf/policies/issues/188#issuecomment-471192202, or mute the thread https://github.com/notifications/unsubscribe-auth/Am63K4A1b6Ufbj19MV771DTY6AfxyTMhks5vU9cVgaJpZM4bmnqa .