nginyc / rafiki

Rafiki is a distributed system that supports training and deployment of machine learning models using AutoML, built with ease-of-use in mind.
Apache License 2.0
36 stars 23 forks source link

Add OBOE to Rafiki #149

Open Xiuyu-Li opened 5 years ago

Xiuyu-Li commented 5 years ago

Integrate OBOE for TABULAR_CLASSIFICATION task

In rafiki/advisor/oboe folder:

Some possible changes to be made:

The license and credits to the original OBOE repository and documentations will be added later.

Xiuyu-Li commented 5 years ago

There are duplicate commits in this PR and it seems like this is due to merging upstream when doing the development. I am not sure how to solve this when merging this PR (use rebase?)

nginyc commented 5 years ago

hi @lxywizard, sorry for attending to this PR late. Are you able to resolve the conflicts with dev first?

nudles commented 5 years ago

There are duplicate commits in this PR and it seems like this is due to merging upstream when doing the development. I am not sure how to solve this when merging this PR (use rebase?)

The simplest way is as follows:

  1. copy the files you changed from your repo to another folder.
  2. fetch the dev branch to you local repo
  3. checkout to the dev branch
  4. copy those files back to the repo
  5. commit and send the PR to dev

In the future, you can rebase your branch to the latest dev branch.

pinpom commented 5 years ago

hi @lxywizard, tried to run your model XgbClf but failed. Error message: `2019-10-04 14:06:06,534 rafiki.utils.service INFO Starting worker "9b3218bda06b" for service of ID "a166b1cf-ad69-4666-be27-fdb763e51885"... 2019-10-04 14:06:07,193 rafiki.worker.train INFO Reading job info from meta store... 2019-10-04 14:06:07,202 rafiki.worker.train INFO Using model "XgbClf"... 2019-10-04 14:06:07,615 rafiki.redis.redis INFO Connecting to Redis at namespace TRAIN:157c0333-32b3-439e-86e2-87a984bdb241... 2019-10-04 14:06:07,615 rafiki.redis.redis INFO Connecting to Redis at namespace PARAMS:157c0333-32b3-439e-86e2-87a984bdb241... 2019-10-04 14:06:07,615 rafiki.worker.train INFO Starting worker for sub train job "157c0333-32b3-439e-86e2-87a984bdb241"... 2019-10-04 14:06:08,037 rafiki.worker.train INFO Starting trial 7859f769-b4e4-42ba-adc2-42a8741baf27 with proposal {'trial_no': 1, 'knobs': {'n_estimators': 122, 'min_child_weight': 5, 'max_depth': 3, 'gamma': 0.570998379009159, 'subsample': 0.5633938164447414, 'colsample_bytree': 0.5589119180136076}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': '7859f769-b4e4-42ba-adc2-42a8741baf27'}... 2019-10-04 14:06:08,038 rafiki.worker.train INFO Marking trial as running in store... 2019-10-04 14:06:08,059 rafiki.worker.train INFO Creating model instance... 2019-10-04 14:06:08,066 rafiki.worker.train INFO Training model... 2019-10-04 14:06:16,468 rafiki.worker.train ERROR Error while running trial: 2019-10-04 14:06:16,477 rafiki.worker.train ERROR Traceback (most recent call last): File "/root/rafiki/worker/train.py", line 113, in _perform_trial self._train_model(model_inst, proposal, shared_params) File "/root/rafiki/worker/train.py", line 177, in _train_model model_inst.train(train_dataset_path, shared_params=shared_params, **(train_args or {})) File "/root/XgbClf-05ff1472-5ff7-4b69-a1cd-898366d48b02.py", line 72, in train File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/base.py", line 357, in score return accuracy_score(y, self.predict(X), sample_weight=sample_weight) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 176, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 88, in _check_targets raise ValueError("{0} is not supported".format(y_type)) ValueError: continuous is not supported

2019-10-04 14:06:16,484 rafiki.worker.train INFO Marking trial as errored in store... 2019-10-04 14:06:16,503 rafiki.redis.train_cache INFO Creating result "{'proposal': {'trial_no': 1, 'knobs': {'n_estimators': 122, 'min_child_weight': 5, 'max_depth': 3, 'gamma': 0.570998379009159, 'subsample': 0.5633938164447414, 'colsample_bytree': 0.5589119180136076}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': '7859f769-b4e4-42ba-adc2-42a8741baf27'}, 'score': None}" for worker "9b3218bda06b"... 2019-10-04 14:06:16,504 rafiki.redis.train_cache INFO Deleting existing proposal for worker "9b3218bda06b"... 2019-10-04 14:06:16,605 rafiki.worker.train INFO Starting trial f6214ca4-6222-4836-8575-2a73cf3fe36c with proposal {'trial_no': 2, 'knobs': {'n_estimators': 59, 'min_child_weight': 6, 'max_depth': 5, 'gamma': 0.039364911210974414, 'subsample': 0.8211606352017686, 'colsample_bytree': 0.37145979412867836}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': 'f6214ca4-6222-4836-8575-2a73cf3fe36c'}... 2019-10-04 14:06:16,605 rafiki.worker.train INFO Marking trial as running in store... 2019-10-04 14:06:16,620 rafiki.worker.train INFO Creating model instance... 2019-10-04 14:06:16,626 rafiki.worker.train INFO Training model... 2019-10-04 14:06:19,728 rafiki.worker.train ERROR Error while running trial: 2019-10-04 14:06:19,734 rafiki.worker.train ERROR Traceback (most recent call last): File "/root/rafiki/worker/train.py", line 113, in _perform_trial self._train_model(model_inst, proposal, shared_params) File "/root/rafiki/worker/train.py", line 177, in _train_model model_inst.train(train_dataset_path, shared_params=shared_params, **(train_args or {})) File "/root/XgbClf-05ff1472-5ff7-4b69-a1cd-898366d48b02.py", line 72, in train File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/base.py", line 357, in score return accuracy_score(y, self.predict(X), sample_weight=sample_weight) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 176, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 88, in _check_targets raise ValueError("{0} is not supported".format(y_type)) ValueError: continuous is not supported

2019-10-04 14:06:19,738 rafiki.worker.train INFO Marking trial as errored in store... 2019-10-04 14:06:19,748 rafiki.redis.train_cache INFO Creating result "{'proposal': {'trial_no': 2, 'knobs': {'n_estimators': 59, 'min_child_weight': 6, 'max_depth': 5, 'gamma': 0.039364911210974414, 'subsample': 0.8211606352017686, 'colsample_bytree': 0.37145979412867836}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': 'f6214ca4-6222-4836-8575-2a73cf3fe36c'}, 'score': None}" for worker "9b3218bda06b"... 2019-10-04 14:06:19,748 rafiki.redis.train_cache INFO Deleting existing proposal for worker "9b3218bda06b"... 2019-10-04 14:06:19,848 rafiki.worker.train INFO Starting trial aef0a1a3-33cb-44f3-ba02-6606aad1afe4 with proposal {'trial_no': 3, 'knobs': {'n_estimators': 156, 'min_child_weight': 5, 'max_depth': 3, 'gamma': 0.1988870869712323, 'subsample': 0.974919369835241, 'colsample_bytree': 0.6632012923087245}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': 'aef0a1a3-33cb-44f3-ba02-6606aad1afe4'}... 2019-10-04 14:06:19,849 rafiki.worker.train INFO Marking trial as running in store... 2019-10-04 14:06:19,863 rafiki.worker.train INFO Creating model instance... 2019-10-04 14:06:19,870 rafiki.worker.train INFO Training model... 2019-10-04 14:06:28,211 rafiki.worker.train ERROR Error while running trial: 2019-10-04 14:06:28,217 rafiki.worker.train ERROR Traceback (most recent call last): File "/root/rafiki/worker/train.py", line 113, in _perform_trial self._train_model(model_inst, proposal, shared_params) File "/root/rafiki/worker/train.py", line 177, in _train_model model_inst.train(train_dataset_path, shared_params=shared_params, **(train_args or {})) File "/root/XgbClf-05ff1472-5ff7-4b69-a1cd-898366d48b02.py", line 72, in train File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/base.py", line 357, in score return accuracy_score(y, self.predict(X), sample_weight=sample_weight) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 176, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 88, in _check_targets raise ValueError("{0} is not supported".format(y_type)) ValueError: continuous is not supported`.

Please check.

Xiuyu-Li commented 4 years ago

@pinpom It seems like this issue is caused by trying to run classification models on dataset used for regression tasks. Can you try again with other dataset like titanic? Or can you specify which dataset you used when you had this issue?

pinpom commented 4 years ago

@pinpom It seems like this issue is caused by trying to run classification models on dataset used for regression tasks. Can you try again with other dataset like titanic? Or can you specify which dataset you used when you had this issue?

@lxywizard I used titanic dataset. for your reference, below are details of my train job: client.get_train_jobs_of_app(app='titanic_app') [{'app': 'titanic_app', 'app_version': 1, 'budget': {'GPU_COUNT': 0, 'MODEL_TRIAL_COUNT': 3, 'TIME_HOURS': 0.1}, 'datetime_started': 'Fri, 04 Oct 2019 14:05:49 GMT', 'datetime_stopped': 'Fri, 04 Oct 2019 14:06:32 GMT', 'id': 'ae450d32-b6e5-4d6c-ad89-3eb42ea58ed7', 'status': 'STOPPED', 'task': 'TABULAR_CLASSIFICATION', 'train_args': None, 'train_dataset_id': '046b5c27-9896-4fbb-8442-fde092a0d3f3', 'val_dataset_id': 'efdbc98f-cb59-40ee-8b9a-511b3696bc6a'}]

Xiuyu-Li commented 4 years ago

@pinpom It seems like you did not put anything into the train_args when initializing the training. Try using train_args={'model_selector': 'oboe', 'features': ['Pclass', 'Sex', 'Age'], 'target':'Survived'} and see if this error still occurs. I also provide you with a sample to do the testing.

pinpom commented 4 years ago

@lxywizard oh yes, sorry my mistake. Thanks for pointing it out. I managed to run it successfully.

Xiuyu-Li commented 4 years ago

@pinpom No worries. Let me know if there are any other issues or changes you want me to make.