Closed flippercy closed 2 years ago
under which you will find folders in the left panel. There is a folder named "outputs" and you can output in a log file there.
For example, you can run automl.fit(log_file_name="outputs/your_log_file.log")
in an AzureML job.
@sonichi agreed, that mlflow snippet should work. @flippercy let me know if you have any issues.
@prithvikannan, @sonichi, Do the following codes make sense?
import mlflow ray_on_aml = Ray_On_AML() ray = ray_on_aml.getRay()
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
automl.fit(X_train = data_dev_balanced_B_X, y_train = data_dev_balanced_B_y, sample_weight=data_dev_balanced_B_w, X_val = data_val_balanced_B_X, y_val = data_val_balanced_B_y, sample_weight_val=data_val_balanced_B_w, **settings)
mlflow.sklearn.log_model(automl, "automl") mlflow.sklearn.log_model(automl.best_model_for_estimator('MylightGBM'), "MylightGBM") mlflow.sklearn.log_model(automl.best_model_for_estimator('MyXgboost'), "MyXgboost")
mlflow.register_model(model_uri=f"{run.info.artifact_uri}/MylightGBM", name='flaml-distributed-MylightGBM') mlflow.register_model(model_uri=f"{run.info.artifact_uri}/MyXgboost", name='flaml-distributed-MyXgboost')
Thank you.
@prithvikannan, @sonichi, Do the following codes make sense?
import mlflow ray_on_aml = Ray_On_AML() ray = ray_on_aml.getRay()
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
automl.fit(X_train = data_dev_balanced_B_X, y_train = data_dev_balanced_B_y, sample_weight=data_dev_balanced_B_w, X_val = data_val_balanced_B_X, y_val = data_val_balanced_B_y, sample_weight_val=data_val_balanced_B_w, **settings)
mlflow.sklearn.log_model(automl, "automl") mlflow.sklearn.log_model(automl.best_model_for_estimator('MylightGBM'), "MylightGBM") mlflow.sklearn.log_model(automl.best_model_for_estimator('MyXgboost'), "MyXgboost")
mlflow.register_model(model_uri=f"{run.info.artifact_uri}/MylightGBM", name='flaml-distributed-MylightGBM') mlflow.register_model(model_uri=f"{run.info.artifact_uri}/MyXgboost", name='flaml-distributed-MyXgboost')
Thank you.
Yes, it looks good to me.
@sonichi, @prithvikannan:
Thank you very much for the reply. A few more questions:
"estimator_list": ['LearnerA', 'LearnerB'],
The automl still can finish with no issue; when logging and registering the models, there is always a weird error message:
Any idea on what it means? Not sure which parameter it refers to.
"message": "User program failed with Exception: UserError: Resource Conflict: ArtifactId ExperimentRun/dcid.distribute-automl-v2_1666366028_5f2255a6/automl/model.pkl already exists.",
Any idea on what causes this?
Appreciate your help.
Re: the error with param length, we have bounded params to be <255 bytes. My guess is that https://github.com/microsoft/FLAML/blob/58227a976bdbf91617a639831f1ed247bcc41538/flaml/automl.py#L3113 is the line that causes the issue, since this will log all of the automl params in a dict. @flippercy can you share the stack trace? @sonichi, since we are already logging params individually, can we remove this line?
Re: artifact logging, my guess is this is some sort of race condition. I am not familiar with Ray-FLAML interaction, so I'm not of much use here.
@prithvikannan @sonichi
Here is the stack trace:
Warnings: AzureMLCompute job failed. JobFailed: Submitted script failed with a non-zero exit code; see the driver log file for details. Reason: Job failed with non-zero exit Code
ActivityFailedException Traceback (most recent call last) Input In [9], in <cell line: 19>() 16 run = exp.submit(config) 18 print(run.get_portal_url()) # link to ml.azure.com ---> 19 run.wait_for_completion(show_output=True)
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/core/run.py:843, in Run.wait_for_completion(self, show_output, wait_post_processing, raise_on_error) 841 if show_output: 842 try: --> 843 self._stream_run_output( 844 file_handle=sys.stdout, 845 wait_post_processing=wait_post_processing, 846 raise_on_error=raise_on_error) 847 return self.get_details() 848 except KeyboardInterrupt:
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/core/run.py:1096, in Run._stream_run_output(self, file_handle, wait_post_processing, raise_on_error) 1094 file_handle.write("\n") 1095 else: -> 1096 raise ActivityFailedException(error_details=json.dumps(error, indent=4)) 1098 file_handle.write("\n") 1099 file_handle.flush()
ActivityFailedException: ActivityFailedException: Message: Activity Failed: { "error": { "code": "UserError", "message": "User program failed with RestException: INVALID_PARAMETER_VALUE: Response: {'Error': {'Code': 'ValidationError', 'Severity': None, 'Message': 'No more than 255 characters per params Value. Request contains 1 of greater length.', 'MessageFormat': None, 'MessageParameters': None, 'ReferenceCode': None, 'DetailsUri': None, 'Target': None, 'Details': [], 'InnerError': None, 'DebugInfo': None, 'AdditionalInfo': None}, 'Correlation': {'operation': '9f1081f11e8cf0a21df39f7f0a5b4b7a', 'request': '34711cbb5b6fc9b2'}, 'Environment': 'southcentralus', 'Location': 'southcentralus', 'Time': '2022-10-21T20:20:14.3227161+00:00', 'ComponentName': 'mlflow', 'error_code': 'INVALID_PARAMETER_VALUE'}", "messageParameters": {}, "detailsUri": "https://aka.ms/azureml-run-troubleshooting", "details": [] }, "time": "0001-01-01T00:00:00.000Z" } InnerException None ErrorResponse { "error": { "message": "Activity Failed:\n{\n \"error\": {\n \"code\": \"UserError\",\n \"message\": \"User program failed with RestException: INVALID_PARAMETER_VALUE: Response: {'Error': {'Code': 'ValidationError', 'Severity': None, 'Message': 'No more than 255 characters per params Value. Request contains 1 of greater length.', 'MessageFormat': None, 'MessageParameters': None, 'ReferenceCode': None, 'DetailsUri': None, 'Target': None, 'Details': [], 'InnerError': None, 'DebugInfo': None, 'AdditionalInfo': None}, 'Correlation': {'operation': '9f1081f11e8cf0a21df39f7f0a5b4b7a', 'request': '34711cbb5b6fc9b2'}, 'Environment': 'southcentralus', 'Location': 'southcentralus', 'Time': '2022-10-21T20:20:14.3227161+00:00', 'ComponentName': 'mlflow', 'error_code': 'INVALID_PARAMETER_VALUE'}\",\n \"messageParameters\": {},\n \"detailsUri\": \"https://aka.ms/azureml-run-troubleshooting\",\n \"details\": []\n },\n \"time\": \"0001-01-01T00:00:00.000Z\"\n}" } }
Re: the error with param length, we have bounded params to be <255 bytes. My guess is that
is the line that causes the issue, since this will log all of the automl params in a dict. @flippercy can you share the stack trace? @sonichi, since we are already logging params individually, can we remove this line?
I doubt that's the cause because (1) there is no error when one learner is used, and (2) the error happens after automl.fit()
. It happens when logging the model.
Re: artifact logging, my guess is this is some sort of race condition. I am not familiar with Ray-FLAML interaction, so I'm not of much use here.
Does this error happen only when ray is used? @flippercy which line causes the error?
@flippercy Another question: Does the logging and registering code work for non-flaml models?
@flippercy It would be very helpful if you can provide a minimal reproducible example.
@sonichi @ruizhuanguw @prithvikannan Thank you for the reply guys! I am off today and will give you some updates tomorrow.
@sonichi , @ruizhuanguw and @prithvikannan:
I did a few tests with both the default learners and customized learners; the result is confusing:
I can confirm that the problem happens when trying to log models using mlflow.sklearn.log_model. My customized learners are all like below:
class MyMonotonicXGBGBTreeClassifier(BaseEstimator):
def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):
super().__init__(task, **params)
self.estimator_class = XGBClassifier
# convert to int for integer hyperparameters
self.params = {
'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
'booster': params['booster'] if 'booster' in params else 'gbtree',
'use_label_encoder': params['use_label_encoder'] if 'use_label_encoder' in params else False,
'eval_metric': params['eval_metric'] if 'eval_metric' in params else 'auc',
'learning_rate': params['learning_rate'],
'gamma': params['gamma'],
'max_depth': int(params['max_depth']),
'min_child_weight': int(params['min_child_weight']),
'subsample': params['subsample'],
'colsample_bylevel':params['colsample_bylevel'],
'n_estimators':int(params['n_estimators']),
'reg_lambda': params['reg_lambda'],
'reg_alpha': params['reg_alpha'],
'random_state': params['random_state'] if 'random_state' in params else randomseed,
"monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
}
@classmethod
def search_space(cls, data_size, task):
space = {
'max_depth': {'domain': tune.uniform(lower=4, upper=8), 'init_value': 8, 'low_cost_init_value': 8},
'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200},
'min_child_weight': {'domain': tune.loguniform(lower = 0.00001, upper = 10), 'init_value': 0.001, 'low_cost_init_value': 0.001},
'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7, 'low_cost_init_value': 0.7},
'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8, 'low_cost_init_value': 0.8},
'learning_rate': {'domain': tune.loguniform(lower = 0.00001, upper = 1), 'init_value': 0.01, 'low_cost_init_value': 0.01},
'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001, 'low_cost_init_value': 0.00001},
'reg_lambda': {'domain': tune.loguniform(lower = 0.0001, upper = 2), 'init_value': 1, 'low_cost_init_value': 1},
'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001, 'low_cost_init_value': 0.000000000001},
}
return space
automl.add_learner(learner_name = 'MyXgboost', learner_class = MyMonotonicXGBGBTreeClassifier)
@sonichi , @ruizhuanguw and @prithvikannan:
I did a few tests with both the default learners and customized learners; the result is confusing:
- If I use customized learners without Ray on a VM, there is no issue at all with logging and registering codes.
- If I use customized learners with Ray on a compute cluster, I got the error aforementioned and the models cannot be logged.
- If I use default learners with Ray on a compute cluster, I still got the error aforementioned but the models can be logged and registered successfully.
- If I use non-flaml models such as lightgbm, there is no issue with logging and registering codes.
I can confirm that the problem happens when trying to log models using mlflow.sklearn.log_model. My customized learners are all like below:
class MyMonotonicXGBGBTreeClassifier(BaseEstimator):
def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params): super().__init__(task, **params) self.estimator_class = XGBClassifier # convert to int for integer hyperparameters self.params = { 'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores, 'booster': params['booster'] if 'booster' in params else 'gbtree', 'use_label_encoder': params['use_label_encoder'] if 'use_label_encoder' in params else False, 'eval_metric': params['eval_metric'] if 'eval_metric' in params else 'auc', 'learning_rate': params['learning_rate'], 'gamma': params['gamma'], 'max_depth': int(params['max_depth']), 'min_child_weight': int(params['min_child_weight']), 'subsample': params['subsample'], 'colsample_bylevel':params['colsample_bylevel'], 'n_estimators':int(params['n_estimators']), 'reg_lambda': params['reg_lambda'], 'reg_alpha': params['reg_alpha'], 'random_state': params['random_state'] if 'random_state' in params else randomseed, "monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone, } @classmethod def search_space(cls, data_size, task): space = { 'max_depth': {'domain': tune.uniform(lower=4, upper=8), 'init_value': 8, 'low_cost_init_value': 8}, 'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200}, 'min_child_weight': {'domain': tune.loguniform(lower = 0.00001, upper = 10), 'init_value': 0.001, 'low_cost_init_value': 0.001}, 'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7, 'low_cost_init_value': 0.7}, 'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8, 'low_cost_init_value': 0.8}, 'learning_rate': {'domain': tune.loguniform(lower = 0.00001, upper = 1), 'init_value': 0.01, 'low_cost_init_value': 0.01}, 'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001, 'low_cost_init_value': 0.00001}, 'reg_lambda': {'domain': tune.loguniform(lower = 0.0001, upper = 2), 'init_value': 1, 'low_cost_init_value': 1}, 'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001, 'low_cost_init_value': 0.000000000001}, } return space
automl.add_learner(learner_name = 'MyXgboost', learner_class = MyMonotonicXGBGBTreeClassifier)
Does **settings
of AutoML.fit()
contain model_history=True
?
@sonichi Yes it is specified that model_history = True.
Since the behaviors of default learners and customized learners are different (although the same error returned for both, models built with default learners can be registered at least), I suspect that there is something wrong with my customized learners when used with Ray on compute clusters. Could you take a look at it, please?
Thank you.
@sonichi Yes it is specified that model_history = True.
Since the behaviors of default learners and customized learners are different (although the same error returned for both, models built with default learners can be registered at least), I suspect that there is something wrong with my customized learners when used with Ray on compute clusters. Could you take a look at it, please?
Thank you.
I tried to reproduce the error but couldn't. Things work fine when I use your example custom learner. My code is:
from ray_on_aml.core import Ray_On_AML
from flaml import AutoML, tune
from flaml.model import BaseEstimator
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import mlflow
from xgboost import XGBClassifier
data, target = load_breast_cancer(return_X_y=True, as_frame=True)
train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.25)
num_cores = 1
randomseed = 42
monotone = None
class MyMonotonicXGBGBTreeClassifier(BaseEstimator):
def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):
super().__init__(task, **params)
self.estimator_class = XGBClassifier
# convert to int for integer hyperparameters
self.params = {
'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
'booster': params['booster'] if 'booster' in params else 'gbtree',
'use_label_encoder': params['use_label_encoder'] if 'use_label_encoder' in params else False,
'eval_metric': params['eval_metric'] if 'eval_metric' in params else 'auc',
'learning_rate': params['learning_rate'],
'gamma': params['gamma'],
'max_depth': int(params['max_depth']),
'min_child_weight': int(params['min_child_weight']),
'subsample': params['subsample'],
'colsample_bylevel':params['colsample_bylevel'],
'n_estimators':int(params['n_estimators']),
'reg_lambda': params['reg_lambda'],
'reg_alpha': params['reg_alpha'],
'random_state': params['random_state'] if 'random_state' in params else randomseed,
"monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
}
@classmethod
def search_space(cls, data_size, task):
space = {
'max_depth': {'domain': tune.uniform(lower=4, upper=8), 'init_value': 8, 'low_cost_init_value': 8},
'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200},
'min_child_weight': {'domain': tune.loguniform(lower = 0.00001, upper = 10), 'init_value': 0.001, 'low_cost_init_value': 0.001},
'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7, 'low_cost_init_value': 0.7},
'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8, 'low_cost_init_value': 0.8},
'learning_rate': {'domain': tune.loguniform(lower = 0.00001, upper = 1), 'init_value': 0.01, 'low_cost_init_value': 0.01},
'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001, 'low_cost_init_value': 0.00001},
'reg_lambda': {'domain': tune.loguniform(lower = 0.0001, upper = 2), 'init_value': 1, 'low_cost_init_value': 1},
'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001, 'low_cost_init_value': 0.000000000001},
}
return space
def _test_ray_classification():
# from sklearn.datasets import make_classification
# ray.init(address="auto")
# X, y = make_classification(1000, 10)
automl = AutoML()
automl.add_learner("myxgb", MyMonotonicXGBGBTreeClassifier)
automl.fit(
train_x, train_y, X_val=test_x, y_val=test_y, time_budget=60, metric="accuracy", task="classification", n_concurrent_trials=2,
estimator_list=["lgbm", "myxgb"], model_history=True
)
print("best loss", automl.best_loss)
print(automl.best_model_for_estimator("lgbm"))
print(automl.best_model_for_estimator("myxgb"))
mlflow.sklearn.log_model(automl.best_model_for_estimator("myxgb"), "MyXGB")
mlflow.sklearn.log_model(automl.best_model_for_estimator("lgbm"), "LGBM")
if __name__ == "__main__":
ray_on_aml = Ray_On_AML()
ray = ray_on_aml.getRay()
if ray:
_test_ray_classification()
@sonichi:
Thank you very much for your help. With your codes, I've figured out that the issue is due to the following codes:
with mlflow.start_run() as run: ....... mlflow.end_run()
If I add them to your example as:
with mlflow.start_run() as run: automl.fit( train_x, train_y, X_val=test_x, y_val=test_y, time_budget=60, metric="accuracy", task="classification", n_concurrent_trials=2, estimator_list=["lgbm", "myxgb"], model_history=True ) mlflow.end_run()
I get the same error mentioned earlier.
However, without these two lines, I do not know how to pull the uri of models in AzureML and register them. Any suggestions?
@sonichi:
Thank you very much for your help. With your codes, I've figured out that the issue is due to the following codes:
with mlflow.start_run() as run: ....... mlflow.end_run()
If I add them to your example as:
with mlflow.start_run() as run: automl.fit( train_x, train_y, X_val=test_x, y_val=test_y, time_budget=60, metric="accuracy", task="classification", n_concurrent_trials=2, estimator_list=["lgbm", "myxgb"], model_history=True ) mlflow.end_run()
I get the same error mentioned earlier.
However, without these two lines, I do not know how to pull the uri of models in AzureML and register them. Any suggestions?
mlflow.sklearn.log_model
is executed out of the with
clause, right? I don't think you need to add the with
clause to get the uri. I'm not an expert of AzureML. If you are using AzureML sdk v2, consider https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-mlflow-models?tabs=fromjob%2Cmir%2Csdk#deploy-using-azure-ml-clisdk-v2
@sonichi My team managed to used other ways to pull the model uri after training and save the model. Problem solved.
Thank you very much for your help with the debugging! I've learned a lot.
Hi @sonichi:
My team are running FLAML distributedly using Ray on compute clusters in AzureML. We have a few questions since we've never used FLAML in this environment before and hope you could provide some insights.
With this setting, what is the best way to log and register the optimal model returned of each learner in AzureML using mlflow? Shall we simply do mlflow.sklearn.log_model(automl.best_model_for_estimator('LearnerA'), "BestModelLearnerA") and then mlflow.register_model(model_uri=f"{run.info.artifact_uri}/LearnerA", name='flaml-LearnerA') ?
Where is the log file and how to change the directory for it?
Thank you!