Add OBOE to Rafiki - Githubissues

Xiuyu-Li commented 5 years ago

Integrate OBOE for TABULAR_CLASSIFICATION task

OBOE is an AutoML algorithm that applies collaborative filtering for model selection and initial hyperparameter tuning (See #137 ). In Rafiki, it is used as a "warm-start" to select the models and part of the hyperparameter config for the first trial of Rafiki's model tuning.
The user can add 'model_selector': 'oboe' in the train_args to use OBOE when creating a TABULAR_CLASSIFICATION train job.

In rafiki/advisor/oboe folder:

The automl/defaults folder contains the configuration classification.json for all current Rafiki TABULAR_CLASSIFICATION models and necessary matrix used by OBOE. It can be applied for any set of current Rafiki TABULAR_CLASSIFICATION models.
If the model developer wants add new TABULAR_CLASSIFICATION model (with class name new_model) for Rafiki and applies OBOE, the matrix and config need to be updated. The model should be imported on the top of automl/util as new_model and automl/defaults/classification.json should be edited with the new model and the selected hyperparameters config.
Notice that the OBOE will select all combinations of the hyperparameters config to do the matrix generation, the number of hyperparameters should be set carefully based on the number of dataset used in error_matrix_generation/dataset.
Run error_matrix_generation/start_matrix_generation.sh script will update the OBOE automatically.

Some possible changes to be made:

Now OBOE works in rafikiai/rafiki_admin docker images. The workflow may be more logical if developing a pipeline for model selection in rafikiai/rafiki_worker.
The update procedure for newly created models should be simplified.
More models and support for TABULAR_REGRESSION to be added

The license and credits to the original OBOE repository and documentations will be added later.

Xiuyu-Li commented 5 years ago

There are duplicate commits in this PR and it seems like this is due to merging upstream when doing the development. I am not sure how to solve this when merging this PR (use rebase?)

nginyc commented 5 years ago

hi @lxywizard, sorry for attending to this PR late. Are you able to resolve the conflicts with dev first?

nudles commented 5 years ago

There are duplicate commits in this PR and it seems like this is due to merging upstream when doing the development. I am not sure how to solve this when merging this PR (use rebase?)

The simplest way is as follows:

copy the files you changed from your repo to another folder.
fetch the dev branch to you local repo
checkout to the dev branch
copy those files back to the repo
commit and send the PR to dev

In the future, you can rebase your branch to the latest dev branch.

pinpom commented 5 years ago

hi @lxywizard, tried to run your model XgbClf but failed. Error message: `2019-10-04 14:06:06,534 rafiki.utils.service INFO Starting worker "9b3218bda06b" for service of ID "a166b1cf-ad69-4666-be27-fdb763e51885"... 2019-10-04 14:06:07,193 rafiki.worker.train INFO Reading job info from meta store... 2019-10-04 14:06:07,202 rafiki.worker.train INFO Using model "XgbClf"... 2019-10-04 14:06:07,615 rafiki.redis.redis INFO Connecting to Redis at namespace TRAIN:157c0333-32b3-439e-86e2-87a984bdb241... 2019-10-04 14:06:07,615 rafiki.redis.redis INFO Connecting to Redis at namespace PARAMS:157c0333-32b3-439e-86e2-87a984bdb241... 2019-10-04 14:06:07,615 rafiki.worker.train INFO Starting worker for sub train job "157c0333-32b3-439e-86e2-87a984bdb241"... 2019-10-04 14:06:08,037 rafiki.worker.train INFO Starting trial 7859f769-b4e4-42ba-adc2-42a8741baf27 with proposal {'trial_no': 1, 'knobs': {'n_estimators': 122, 'min_child_weight': 5, 'max_depth': 3, 'gamma': 0.570998379009159, 'subsample': 0.5633938164447414, 'colsample_bytree': 0.5589119180136076}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': '7859f769-b4e4-42ba-adc2-42a8741baf27'}... 2019-10-04 14:06:08,038 rafiki.worker.train INFO Marking trial as running in store... 2019-10-04 14:06:08,059 rafiki.worker.train INFO Creating model instance... 2019-10-04 14:06:08,066 rafiki.worker.train INFO Training model... 2019-10-04 14:06:16,468 rafiki.worker.train ERROR Error while running trial: 2019-10-04 14:06:16,477 rafiki.worker.train ERROR Traceback (most recent call last): File "/root/rafiki/worker/train.py", line 113, in _perform_trial self._train_model(model_inst, proposal, shared_params) File "/root/rafiki/worker/train.py", line 177, in _train_model model_inst.train(train_dataset_path, shared_params=shared_params, **(train_args or {})) File "/root/XgbClf-05ff1472-5ff7-4b69-a1cd-898366d48b02.py", line 72, in train File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/base.py", line 357, in score return accuracy_score(y, self.predict(X), sample_weight=sample_weight) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 176, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 88, in _check_targets raise ValueError("{0} is not supported".format(y_type)) ValueError: continuous is not supported

2019-10-04 14:06:16,484 rafiki.worker.train INFO Marking trial as errored in store... 2019-10-04 14:06:16,503 rafiki.redis.train_cache INFO Creating result "{'proposal': {'trial_no': 1, 'knobs': {'n_estimators': 122, 'min_child_weight': 5, 'max_depth': 3, 'gamma': 0.570998379009159, 'subsample': 0.5633938164447414, 'colsample_bytree': 0.5589119180136076}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': '7859f769-b4e4-42ba-adc2-42a8741baf27'}, 'score': None}" for worker "9b3218bda06b"... 2019-10-04 14:06:16,504 rafiki.redis.train_cache INFO Deleting existing proposal for worker "9b3218bda06b"... 2019-10-04 14:06:16,605 rafiki.worker.train INFO Starting trial f6214ca4-6222-4836-8575-2a73cf3fe36c with proposal {'trial_no': 2, 'knobs': {'n_estimators': 59, 'min_child_weight': 6, 'max_depth': 5, 'gamma': 0.039364911210974414, 'subsample': 0.8211606352017686, 'colsample_bytree': 0.37145979412867836}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': 'f6214ca4-6222-4836-8575-2a73cf3fe36c'}... 2019-10-04 14:06:16,605 rafiki.worker.train INFO Marking trial as running in store... 2019-10-04 14:06:16,620 rafiki.worker.train INFO Creating model instance... 2019-10-04 14:06:16,626 rafiki.worker.train INFO Training model... 2019-10-04 14:06:19,728 rafiki.worker.train ERROR Error while running trial: 2019-10-04 14:06:19,734 rafiki.worker.train ERROR Traceback (most recent call last): File "/root/rafiki/worker/train.py", line 113, in _perform_trial self._train_model(model_inst, proposal, shared_params) File "/root/rafiki/worker/train.py", line 177, in _train_model model_inst.train(train_dataset_path, shared_params=shared_params, **(train_args or {})) File "/root/XgbClf-05ff1472-5ff7-4b69-a1cd-898366d48b02.py", line 72, in train File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/base.py", line 357, in score return accuracy_score(y, self.predict(X), sample_weight=sample_weight) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 176, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 88, in _check_targets raise ValueError("{0} is not supported".format(y_type)) ValueError: continuous is not supported

2019-10-04 14:06:19,738 rafiki.worker.train INFO Marking trial as errored in store... 2019-10-04 14:06:19,748 rafiki.redis.train_cache INFO Creating result "{'proposal': {'trial_no': 2, 'knobs': {'n_estimators': 59, 'min_child_weight': 6, 'max_depth': 5, 'gamma': 0.039364911210974414, 'subsample': 0.8211606352017686, 'colsample_bytree': 0.37145979412867836}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': 'f6214ca4-6222-4836-8575-2a73cf3fe36c'}, 'score': None}" for worker "9b3218bda06b"... 2019-10-04 14:06:19,748 rafiki.redis.train_cache INFO Deleting existing proposal for worker "9b3218bda06b"... 2019-10-04 14:06:19,848 rafiki.worker.train INFO Starting trial aef0a1a3-33cb-44f3-ba02-6606aad1afe4 with proposal {'trial_no': 3, 'knobs': {'n_estimators': 156, 'min_child_weight': 5, 'max_depth': 3, 'gamma': 0.1988870869712323, 'subsample': 0.974919369835241, 'colsample_bytree': 0.6632012923087245}, 'params_type': 'NONE', 'to_eval': True, 'to_cache_params': False, 'to_save_params': True, 'meta': {'proposal_type': 'SEARCH'}, 'trial_id': 'aef0a1a3-33cb-44f3-ba02-6606aad1afe4'}... 2019-10-04 14:06:19,849 rafiki.worker.train INFO Marking trial as running in store... 2019-10-04 14:06:19,863 rafiki.worker.train INFO Creating model instance... 2019-10-04 14:06:19,870 rafiki.worker.train INFO Training model... 2019-10-04 14:06:28,211 rafiki.worker.train ERROR Error while running trial: 2019-10-04 14:06:28,217 rafiki.worker.train ERROR Traceback (most recent call last): File "/root/rafiki/worker/train.py", line 113, in _perform_trial self._train_model(model_inst, proposal, shared_params) File "/root/rafiki/worker/train.py", line 177, in _train_model model_inst.train(train_dataset_path, shared_params=shared_params, **(train_args or {})) File "/root/XgbClf-05ff1472-5ff7-4b69-a1cd-898366d48b02.py", line 72, in train File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/base.py", line 357, in score return accuracy_score(y, self.predict(X), sample_weight=sample_weight) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 176, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 88, in _check_targets raise ValueError("{0} is not supported".format(y_type)) ValueError: continuous is not supported`.

Please check.

Xiuyu-Li commented 4 years ago

@pinpom It seems like this issue is caused by trying to run classification models on dataset used for regression tasks. Can you try again with other dataset like titanic? Or can you specify which dataset you used when you had this issue?

pinpom commented 4 years ago

@pinpom It seems like this issue is caused by trying to run classification models on dataset used for regression tasks. Can you try again with other dataset like titanic? Or can you specify which dataset you used when you had this issue?

@lxywizard I used titanic dataset. for your reference, below are details of my train job: client.get_train_jobs_of_app(app='titanic_app') [{'app': 'titanic_app', 'app_version': 1, 'budget': {'GPU_COUNT': 0, 'MODEL_TRIAL_COUNT': 3, 'TIME_HOURS': 0.1}, 'datetime_started': 'Fri, 04 Oct 2019 14:05:49 GMT', 'datetime_stopped': 'Fri, 04 Oct 2019 14:06:32 GMT', 'id': 'ae450d32-b6e5-4d6c-ad89-3eb42ea58ed7', 'status': 'STOPPED', 'task': 'TABULAR_CLASSIFICATION', 'train_args': None, 'train_dataset_id': '046b5c27-9896-4fbb-8442-fde092a0d3f3', 'val_dataset_id': 'efdbc98f-cb59-40ee-8b9a-511b3696bc6a'}]

Xiuyu-Li commented 4 years ago

@pinpom It seems like you did not put anything into the train_args when initializing the training. Try using train_args={'model_selector': 'oboe', 'features': ['Pclass', 'Sex', 'Age'], 'target':'Survived'} and see if this error still occurs. I also provide you with a sample to do the testing.

pinpom commented 4 years ago

@lxywizard oh yes, sorry my mistake. Thanks for pointing it out. I managed to run it successfully.

Xiuyu-Li commented 4 years ago

@pinpom No worries. Let me know if there are any other issues or changes you want me to make.

nginyc / rafiki

Add OBOE to Rafiki #149