A few questions on running FLAML distributedly via Ray on compute clusters in AzureML

microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

https://microsoft.github.io/FLAML/

MIT License

3.93k stars 513 forks source link

A few questions on running FLAML distributedly via Ray on compute clusters in AzureML #769

Closed flippercy closed 2 years ago

flippercy commented 2 years ago

Hi @sonichi:

My team are running FLAML distributedly using Ray on compute clusters in AzureML. We have a few questions since we've never used FLAML in this environment before and hope you could provide some insights.

With this setting, what is the best way to log and register the optimal model returned of each learner in AzureML using mlflow? Shall we simply do mlflow.sklearn.log_model(automl.best_model_for_estimator('LearnerA'), "BestModelLearnerA") and then mlflow.register_model(model_uri=f"{run.info.artifact_uri}/LearnerA", name='flaml-LearnerA') ?
Where is the log file and how to change the directory for it?

Thank you!

sonichi commented 2 years ago

It looks good to me. @ruizhuanguw @prithvikannan @mshtelma what do you think?
How did you submit the AzureML job? When you open an AzureML experiment, usually there is a tab
under which you will find folders in the left panel. There is a folder named "outputs" and you can output in a log file there.

For example, you can run automl.fit(log_file_name="outputs/your_log_file.log") in an AzureML job.

ruizhuanguw commented 2 years ago

I did not tried registering model using mlflow before. But it looks right according to the Azure ML doc
second @sonichi's suggestion.

prithvikannan commented 2 years ago

@sonichi agreed, that mlflow snippet should work. @flippercy let me know if you have any issues.

flippercy commented 2 years ago

@prithvikannan, @sonichi, Do the following codes make sense?

import mlflow ray_on_aml = Ray_On_AML() ray = ray_on_aml.getRay()

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

automl.fit(X_train = data_dev_balanced_B_X, y_train = data_dev_balanced_B_y, sample_weight=data_dev_balanced_B_w, X_val = data_val_balanced_B_X, y_val = data_val_balanced_B_y, sample_weight_val=data_val_balanced_B_w, **settings)

mlflow.sklearn.log_model(automl, "automl") mlflow.sklearn.log_model(automl.best_model_for_estimator('MylightGBM'), "MylightGBM") mlflow.sklearn.log_model(automl.best_model_for_estimator('MyXgboost'), "MyXgboost")

mlflow.register_model(model_uri=f"{run.info.artifact_uri}/MylightGBM", name='flaml-distributed-MylightGBM') mlflow.register_model(model_uri=f"{run.info.artifact_uri}/MyXgboost", name='flaml-distributed-MyXgboost')

Thank you.

sonichi commented 2 years ago

@prithvikannan, @sonichi, Do the following codes make sense?

import mlflow ray_on_aml = Ray_On_AML() ray = ray_on_aml.getRay()

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

automl.fit(X_train = data_dev_balanced_B_X, y_train = data_dev_balanced_B_y, sample_weight=data_dev_balanced_B_w, X_val = data_val_balanced_B_X, y_val = data_val_balanced_B_y, sample_weight_val=data_val_balanced_B_w, **settings)

mlflow.sklearn.log_model(automl, "automl") mlflow.sklearn.log_model(automl.best_model_for_estimator('MylightGBM'), "MylightGBM") mlflow.sklearn.log_model(automl.best_model_for_estimator('MyXgboost'), "MyXgboost")

mlflow.register_model(model_uri=f"{run.info.artifact_uri}/MylightGBM", name='flaml-distributed-MylightGBM') mlflow.register_model(model_uri=f"{run.info.artifact_uri}/MyXgboost", name='flaml-distributed-MyXgboost')

Thank you.

Yes, it looks good to me.

flippercy commented 2 years ago

@sonichi, @prithvikannan:

Thank you very much for the reply. A few more questions:

Usually we use a bunch of customized learners for FLAML. It seems that if we only use one learner, no matter what it is, the automl process can be done distributedly and the model can be registered. However, if we use more than one learner such as:

"estimator_list": ['LearnerA', 'LearnerB'],

The automl still can finish with no issue; when logging and registering the models, there is always a weird error message:

Any idea on what it means? Not sure which parameter it refers to.

If we only use one learner, although the model can be built, logged and registered successfully, there is always an error at the end of the process stating that the pkl file already exists.

"message": "User program failed with Exception: UserError: Resource Conflict: ArtifactId ExperimentRun/dcid.distribute-automl-v2_1666366028_5f2255a6/automl/model.pkl already exists.",

Any idea on what causes this?

Appreciate your help.

prithvikannan commented 2 years ago

Re: the error with param length, we have bounded params to be <255 bytes. My guess is that https://github.com/microsoft/FLAML/blob/58227a976bdbf91617a639831f1ed247bcc41538/flaml/automl.py#L3113 is the line that causes the issue, since this will log all of the automl params in a dict. @flippercy can you share the stack trace? @sonichi, since we are already logging params individually, can we remove this line?

Re: artifact logging, my guess is this is some sort of race condition. I am not familiar with Ray-FLAML interaction, so I'm not of much use here.

flippercy commented 2 years ago

@prithvikannan @sonichi

Here is the stack trace:

Warnings: AzureMLCompute job failed. JobFailed: Submitted script failed with a non-zero exit code; see the driver log file for details. Reason: Job failed with non-zero exit Code

Show less

ActivityFailedException Traceback (most recent call last) Input In [9], in <cell line: 19>() 16 run = exp.submit(config) 18 print(run.get_portal_url()) # link to ml.azure.com ---> 19 run.wait_for_completion(show_output=True)

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/core/run.py:843, in Run.wait_for_completion(self, show_output, wait_post_processing, raise_on_error) 841 if show_output: 842 try: --> 843 self._stream_run_output( 844 file_handle=sys.stdout, 845 wait_post_processing=wait_post_processing, 846 raise_on_error=raise_on_error) 847 return self.get_details() 848 except KeyboardInterrupt:

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/core/run.py:1096, in Run._stream_run_output(self, file_handle, wait_post_processing, raise_on_error) 1094 file_handle.write("\n") 1095 else: -> 1096 raise ActivityFailedException(error_details=json.dumps(error, indent=4)) 1098 file_handle.write("\n") 1099 file_handle.flush()

ActivityFailedException: ActivityFailedException: Message: Activity Failed: { "error": { "code": "UserError", "message": "User program failed with RestException: INVALID_PARAMETER_VALUE: Response: {'Error': {'Code': 'ValidationError', 'Severity': None, 'Message': 'No more than 255 characters per params Value. Request contains 1 of greater length.', 'MessageFormat': None, 'MessageParameters': None, 'ReferenceCode': None, 'DetailsUri': None, 'Target': None, 'Details': [], 'InnerError': None, 'DebugInfo': None, 'AdditionalInfo': None}, 'Correlation': {'operation': '9f1081f11e8cf0a21df39f7f0a5b4b7a', 'request': '34711cbb5b6fc9b2'}, 'Environment': 'southcentralus', 'Location': 'southcentralus', 'Time': '2022-10-21T20:20:14.3227161+00:00', 'ComponentName': 'mlflow', 'error_code': 'INVALID_PARAMETER_VALUE'}", "messageParameters": {}, "detailsUri": "https://aka.ms/azureml-run-troubleshooting", "details": [] }, "time": "0001-01-01T00:00:00.000Z" } InnerException None ErrorResponse { "error": { "message": "Activity Failed:\n{\n \"error\": {\n \"code\": \"UserError\",\n \"message\": \"User program failed with RestException: INVALID_PARAMETER_VALUE: Response: {'Error': {'Code': 'ValidationError', 'Severity': None, 'Message': 'No more than 255 characters per params Value. Request contains 1 of greater length.', 'MessageFormat': None, 'MessageParameters': None, 'ReferenceCode': None, 'DetailsUri': None, 'Target': None, 'Details': [], 'InnerError': None, 'DebugInfo': None, 'AdditionalInfo': None}, 'Correlation': {'operation': '9f1081f11e8cf0a21df39f7f0a5b4b7a', 'request': '34711cbb5b6fc9b2'}, 'Environment': 'southcentralus', 'Location': 'southcentralus', 'Time': '2022-10-21T20:20:14.3227161+00:00', 'ComponentName': 'mlflow', 'error_code': 'INVALID_PARAMETER_VALUE'}\",\n \"messageParameters\": {},\n \"detailsUri\": \"https://aka.ms/azureml-run-troubleshooting\",\n \"details\": []\n },\n \"time\": \"0001-01-01T00:00:00.000Z\"\n}" } }

sonichi commented 2 years ago

Re: the error with param length, we have bounded params to be <255 bytes. My guess is that

https://github.com/microsoft/FLAML/blob/58227a976bdbf91617a639831f1ed247bcc41538/flaml/automl.py#L3113

is the line that causes the issue, since this will log all of the automl params in a dict. @flippercy can you share the stack trace? @sonichi, since we are already logging params individually, can we remove this line?

I doubt that's the cause because (1) there is no error when one learner is used, and (2) the error happens after automl.fit(). It happens when logging the model.

Re: artifact logging, my guess is this is some sort of race condition. I am not familiar with Ray-FLAML interaction, so I'm not of much use here.

Does this error happen only when ray is used? @flippercy which line causes the error?

sonichi commented 2 years ago

@flippercy Another question: Does the logging and registering code work for non-flaml models?

ruizhuanguw commented 2 years ago

@flippercy It would be very helpful if you can provide a minimal reproducible example.

flippercy commented 2 years ago

@sonichi @ruizhuanguw @prithvikannan Thank you for the reply guys! I am off today and will give you some updates tomorrow.

flippercy commented 2 years ago

@sonichi , @ruizhuanguw and @prithvikannan:

I did a few tests with both the default learners and customized learners; the result is confusing:

If I use customized learners without Ray on a VM, there is no issue at all with logging and registering codes.
If I use customized learners with Ray on a compute cluster, I got the error aforementioned and the models cannot be logged.
If I use default learners with Ray on a compute cluster, I still got the error aforementioned but the models can be logged and registered successfully.
If I use non-flaml models such as lightgbm, there is no issue with logging and registering codes.

I can confirm that the problem happens when trying to log models using mlflow.sklearn.log_model. My customized learners are all like below:

class MyMonotonicXGBGBTreeClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = XGBClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'booster': params['booster'] if 'booster' in params else 'gbtree',
        'use_label_encoder': params['use_label_encoder'] if 'use_label_encoder' in params else False,
        'eval_metric': params['eval_metric'] if 'eval_metric' in params else 'auc',
        'learning_rate': params['learning_rate'],
        'gamma': params['gamma'],
        'max_depth': int(params['max_depth']),
        'min_child_weight': int(params['min_child_weight']),
        'subsample': params['subsample'],
        'colsample_bylevel':params['colsample_bylevel'],
        'n_estimators':int(params['n_estimators']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,

   }   

@classmethod
def search_space(cls, data_size, task):

    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=8), 'init_value': 8, 'low_cost_init_value': 8},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200},
    'min_child_weight': {'domain': tune.loguniform(lower = 0.00001, upper = 10), 'init_value': 0.001, 'low_cost_init_value': 0.001},
    'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7, 'low_cost_init_value': 0.7},
    'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8, 'low_cost_init_value': 0.8},
    'learning_rate': {'domain': tune.loguniform(lower = 0.00001, upper = 1), 'init_value': 0.01, 'low_cost_init_value': 0.01},
    'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001, 'low_cost_init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.0001, upper = 2), 'init_value': 1, 'low_cost_init_value': 1},
    'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001, 'low_cost_init_value': 0.000000000001},
    }
    return space

automl.add_learner(learner_name = 'MyXgboost', learner_class = MyMonotonicXGBGBTreeClassifier)

sonichi commented 2 years ago

@sonichi , @ruizhuanguw and @prithvikannan:

I did a few tests with both the default learners and customized learners; the result is confusing:

If I use customized learners without Ray on a VM, there is no issue at all with logging and registering codes.
If I use customized learners with Ray on a compute cluster, I got the error aforementioned and the models cannot be logged.
If I use default learners with Ray on a compute cluster, I still got the error aforementioned but the models can be logged and registered successfully.
If I use non-flaml models such as lightgbm, there is no issue with logging and registering codes.

I can confirm that the problem happens when trying to log models using mlflow.sklearn.log_model. My customized learners are all like below:

class MyMonotonicXGBGBTreeClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = XGBClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'booster': params['booster'] if 'booster' in params else 'gbtree',
        'use_label_encoder': params['use_label_encoder'] if 'use_label_encoder' in params else False,
        'eval_metric': params['eval_metric'] if 'eval_metric' in params else 'auc',
        'learning_rate': params['learning_rate'],
        'gamma': params['gamma'],
        'max_depth': int(params['max_depth']),
        'min_child_weight': int(params['min_child_weight']),
        'subsample': params['subsample'],
        'colsample_bylevel':params['colsample_bylevel'],
        'n_estimators':int(params['n_estimators']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,

   }   

@classmethod
def search_space(cls, data_size, task):

    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=8), 'init_value': 8, 'low_cost_init_value': 8},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200},
    'min_child_weight': {'domain': tune.loguniform(lower = 0.00001, upper = 10), 'init_value': 0.001, 'low_cost_init_value': 0.001},
    'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7, 'low_cost_init_value': 0.7},
    'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8, 'low_cost_init_value': 0.8},
    'learning_rate': {'domain': tune.loguniform(lower = 0.00001, upper = 1), 'init_value': 0.01, 'low_cost_init_value': 0.01},
    'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001, 'low_cost_init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.0001, upper = 2), 'init_value': 1, 'low_cost_init_value': 1},
    'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001, 'low_cost_init_value': 0.000000000001},
    }
    return space

automl.add_learner(learner_name = 'MyXgboost', learner_class = MyMonotonicXGBGBTreeClassifier)

Does **settings of AutoML.fit() contain model_history=True?

flippercy commented 2 years ago

@sonichi Yes it is specified that model_history = True.

Since the behaviors of default learners and customized learners are different (although the same error returned for both, models built with default learners can be registered at least), I suspect that there is something wrong with my customized learners when used with Ray on compute clusters. Could you take a look at it, please?

Thank you.

sonichi commented 2 years ago

@sonichi Yes it is specified that model_history = True.

Since the behaviors of default learners and customized learners are different (although the same error returned for both, models built with default learners can be registered at least), I suspect that there is something wrong with my customized learners when used with Ray on compute clusters. Could you take a look at it, please?

Thank you.

I tried to reproduce the error but couldn't. Things work fine when I use your example custom learner. My code is:

from ray_on_aml.core import Ray_On_AML
from flaml import AutoML, tune
from flaml.model import BaseEstimator
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import mlflow
from xgboost import XGBClassifier

data, target = load_breast_cancer(return_X_y=True, as_frame=True)
train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.25)
num_cores = 1
randomseed = 42
monotone = None

class MyMonotonicXGBGBTreeClassifier(BaseEstimator):

    def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

        super().__init__(task, **params)

        self.estimator_class = XGBClassifier

        # convert to int for integer hyperparameters
        self.params = {
            'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
            'booster': params['booster'] if 'booster' in params else 'gbtree',
            'use_label_encoder': params['use_label_encoder'] if 'use_label_encoder' in params else False,
            'eval_metric': params['eval_metric'] if 'eval_metric' in params else 'auc',
            'learning_rate': params['learning_rate'],
            'gamma': params['gamma'],
            'max_depth': int(params['max_depth']),
            'min_child_weight': int(params['min_child_weight']),
            'subsample': params['subsample'],
            'colsample_bylevel':params['colsample_bylevel'],
            'n_estimators':int(params['n_estimators']),
            'reg_lambda': params['reg_lambda'],
            'reg_alpha': params['reg_alpha'],
            'random_state': params['random_state'] if 'random_state' in params else randomseed,
            "monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,        
        }   

    @classmethod
    def search_space(cls, data_size, task):

        space = {        
            'max_depth': {'domain': tune.uniform(lower=4, upper=8), 'init_value': 8, 'low_cost_init_value': 8},
            'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200},
            'min_child_weight': {'domain': tune.loguniform(lower = 0.00001, upper = 10), 'init_value': 0.001, 'low_cost_init_value': 0.001},
            'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7, 'low_cost_init_value': 0.7},
            'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8, 'low_cost_init_value': 0.8},
            'learning_rate': {'domain': tune.loguniform(lower = 0.00001, upper = 1), 'init_value': 0.01, 'low_cost_init_value': 0.01},
            'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001, 'low_cost_init_value': 0.00001},
            'reg_lambda': {'domain': tune.loguniform(lower = 0.0001, upper = 2), 'init_value': 1, 'low_cost_init_value': 1},
            'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001, 'low_cost_init_value': 0.000000000001},
        }
        return space

def _test_ray_classification():
    # from sklearn.datasets import make_classification

    # ray.init(address="auto")
    # X, y = make_classification(1000, 10)
    automl = AutoML()
    automl.add_learner("myxgb", MyMonotonicXGBGBTreeClassifier)
    automl.fit(
        train_x, train_y, X_val=test_x, y_val=test_y, time_budget=60, metric="accuracy", task="classification", n_concurrent_trials=2,
        estimator_list=["lgbm", "myxgb"], model_history=True
    )
    print("best loss", automl.best_loss)
    print(automl.best_model_for_estimator("lgbm"))
    print(automl.best_model_for_estimator("myxgb"))
    mlflow.sklearn.log_model(automl.best_model_for_estimator("myxgb"), "MyXGB")
    mlflow.sklearn.log_model(automl.best_model_for_estimator("lgbm"), "LGBM")

if __name__ == "__main__":
    ray_on_aml = Ray_On_AML()
    ray = ray_on_aml.getRay()
    if ray:
        _test_ray_classification()

flippercy commented 2 years ago

@sonichi:

Thank you very much for your help. With your codes, I've figured out that the issue is due to the following codes:

with mlflow.start_run() as run: ....... mlflow.end_run()

If I add them to your example as:

with mlflow.start_run() as run: automl.fit( train_x, train_y, X_val=test_x, y_val=test_y, time_budget=60, metric="accuracy", task="classification", n_concurrent_trials=2, estimator_list=["lgbm", "myxgb"], model_history=True ) mlflow.end_run()

I get the same error mentioned earlier.

However, without these two lines, I do not know how to pull the uri of models in AzureML and register them. Any suggestions?

sonichi commented 2 years ago

@sonichi:

Thank you very much for your help. With your codes, I've figured out that the issue is due to the following codes:

with mlflow.start_run() as run: ....... mlflow.end_run()

If I add them to your example as:

with mlflow.start_run() as run: automl.fit( train_x, train_y, X_val=test_x, y_val=test_y, time_budget=60, metric="accuracy", task="classification", n_concurrent_trials=2, estimator_list=["lgbm", "myxgb"], model_history=True ) mlflow.end_run()

I get the same error mentioned earlier.

However, without these two lines, I do not know how to pull the uri of models in AzureML and register them. Any suggestions?

mlflow.sklearn.log_model is executed out of the with clause, right? I don't think you need to add the with clause to get the uri. I'm not an expert of AzureML. If you are using AzureML sdk v2, consider https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-mlflow-models?tabs=fromjob%2Cmir%2Csdk#deploy-using-azure-ml-clisdk-v2

flippercy commented 2 years ago

@sonichi My team managed to used other ways to pull the model uri after training and save the model. Problem solved.

Thank you very much for your help with the debugging! I've learned a lot.