ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.7k stars 5.73k forks source link

[AIR] mlflow_setup() raises error on Databricks cluster #35431

Open JasperHG90 opened 1 year ago

JasperHG90 commented 1 year ago

What happened + What you expected to happen

I’m using Ray on a Databricks cluster. I’m trying to run Tune with the built-in Mlflow tracking server. I’m following the instructions here to achieve this (using ‘setup_mlflow’ approach).

I expect ray to be able to connect to the MLFlow tracking server. Instead, I get the following error:

raise InvalidConfigurationError.for_profile(None)
databricks_cli.utils.InvalidConfigurationError: You haven't configured the CLI yet! Please configure by entering `/local_disk0/.ephemeral_nfs/envs/pythonEnv-c299d8bb-1817-4e6c-9b78-edc9ceb94e01/lib/python3.9/site-packages/ray/_private/workers/default_worker.py configure 

I’ve tried to add the ‘tracking_token’ as an input as well, but to no avail.

Using the callback method works, but is not satisfactory as I cannot control what is logged.

Versions / Dependencies

Databricks: DBR 2.12 LTS ML; spark 3.3.2; scala 2.12 Ray: 2.4.0 Python: 3.9 MLFlow: 2.1.1

Reproduction script

I cannot give a reproducible example right now, but will prepare one later when I’m back from holiday.

def objective(config: dict, X: pd.DataFrame, y: pd.DataFrame):
    mlflow = setup_mlflow(config)
    mlflow.sklearn.autolog(log_models=True)
    model = HistGradientBoostingRegressor(**config)
    pipeline = _create_pipeline(model)
    cross_validated = cross_val_score(
        pipeline, X, y, scoring="neg_mean_squared_error",
        cv=10
    )
    return {"rmse": np.mean(np.sqrt(cross_validated * -1))}

search_space = {
    "learning_rate": tune.loguniform(1e-4, 0.3),
    "max_depth": tune.choice([3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 18, 20]),
    "l2_regularization": tune.uniform(0, 1),
    "warm_start": tune.choice([True, False]),
    "max_iter": tune.choice([100, 200, 300, 400]),
    "mlflow": {
        "experiment_name":"/Users/EMAIL/regressor",
        "tracking_uri": mlflow.get_tracking_uri(),
    }
}

algo = HyperOptSearch(
    metric="rmse",
    mode="min"
)

tuner = tune.Tuner(
    trainable=tune.with_parameters(
        objective, 
        X=df_X.loc[:, NUM_FEATURES + CAT_FEATURES], 
        y=df_y.loc[:, [TARGET]]
    ),
    param_space=search_space,
    tune_config=tune.TuneConfig(
        num_samples=1,
        search_alg=algo,
        metric="rmse"
    ),
    run_config=RunConfig(
        name="Regressor",
    )
)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

zshuyinggg commented 2 months ago

What happened + What you expected to happen

I’m using Ray on a Databricks cluster. I’m trying to run Tune with the built-in Mlflow tracking server. I’m following the instructions here to achieve this (using ‘setup_mlflow’ approach).

I expect ray to be able to connect to the MLFlow tracking server. Instead, I get the following error:

raise InvalidConfigurationError.for_profile(None)
databricks_cli.utils.InvalidConfigurationError: You haven't configured the CLI yet! Please configure by entering `/local_disk0/.ephemeral_nfs/envs/pythonEnv-c299d8bb-1817-4e6c-9b78-edc9ceb94e01/lib/python3.9/site-packages/ray/_private/workers/default_worker.py configure 

I’ve tried to add the ‘tracking_token’ as an input as well, but to no avail.

Using the callback method works, but is not satisfactory as I cannot control what is logged.

Versions / Dependencies

Databricks: DBR 2.12 LTS ML; spark 3.3.2; scala 2.12 Ray: 2.4.0 Python: 3.9 MLFlow: 2.1.1

Reproduction script

I cannot give a reproducible example right now, but will prepare one later when I’m back from holiday.

def objective(config: dict, X: pd.DataFrame, y: pd.DataFrame):
    mlflow = setup_mlflow(config)
    mlflow.sklearn.autolog(log_models=True)
    model = HistGradientBoostingRegressor(**config)
    pipeline = _create_pipeline(model)
    cross_validated = cross_val_score(
        pipeline, X, y, scoring="neg_mean_squared_error",
        cv=10
    )
    return {"rmse": np.mean(np.sqrt(cross_validated * -1))}

search_space = {
    "learning_rate": tune.loguniform(1e-4, 0.3),
    "max_depth": tune.choice([3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 18, 20]),
    "l2_regularization": tune.uniform(0, 1),
    "warm_start": tune.choice([True, False]),
    "max_iter": tune.choice([100, 200, 300, 400]),
    "mlflow": {
        "experiment_name":"/Users/EMAIL/regressor",
        "tracking_uri": mlflow.get_tracking_uri(),
    }
}

algo = HyperOptSearch(
    metric="rmse",
    mode="min"
)

tuner = tune.Tuner(
    trainable=tune.with_parameters(
        objective, 
        X=df_X.loc[:, NUM_FEATURES + CAT_FEATURES], 
        y=df_y.loc[:, [TARGET]]
    ),
    param_space=search_space,
    tune_config=tune.TuneConfig(
        num_samples=1,
        search_alg=algo,
        metric="rmse"
    ),
    run_config=RunConfig(
        name="Regressor",
    )
)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

I am having the same issue. Can you share your workaround please?

zshuyinggg commented 2 months ago

I fixed this by following this page https://docs.ray.io/en/latest/train/user-guides/experiment-tracking.html#set-up-credentials especially the section "Set up credentials".