ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.73k stars 5.55k forks source link

[air] MLFlowLoggerCallback Does not include Artifacts #46569

Open dvanbrug opened 1 month ago

dvanbrug commented 1 month ago

What happened + What you expected to happen

When using the MLflowLoggerCallback and specifying save_artifacts=True, I do not see artifacts logged to the MLFlow server.

From some initial investigations, I believe it is related to the introduction of the temporary local staging directory (https://docs.ray.io/en/releases-2.10.0/train/user-guides/persistent-storage.html#setting-the-local-staging-directory).

While checkpoints are saved directly to storage_path, other files are temporarily saved to this temp directory. Additionally, MLflowLoggerCallback appears to use the local_path property which points to this temp directory as well.

https://github.com/ray-project/ray/blob/bb1759aaa745a37bc4a5053ce12d2115206ccd11/python/ray/tune/experiment/trial.py#L548-L550

Thus, the checkpoints are not included as artifacts to the MLFlow server.

Instead, I think the MLflowLoggerCallback should use the path property instead of local_path so that it has access to the checkponts.

Versions / Dependencies

It works in 2.9.2 and I believe it stops working in 2.10. Specifically, I believe it is related to the introduction of the local staging directory.

Reproduction script

from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig
from ray.air.integrations.mlflow import MLflowLoggerCallback

config = PPOConfig().environment("CartPole-v1").training(train_batch_size=4000)

mlflow_experiment_name = "PPO_CartPole"

analysis = tune.run(
    "PPO",
    config=config,
    stop={
        "training_iteration": 2
        },
    checkpoint_freq = 1,
    checkpoint_at_end=True,
    callbacks=[
        MLflowLoggerCallback(
            experiment_name=mlflow_experiment_name,
            tracking_uri="http://localhost:5001",
            save_artifact=True
        )
    ]
)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

anyscalesam commented 1 month ago

TY for contributing here @dvanbrug

anyscalesam commented 1 month ago

@amogkam @architkulkarni think one of you would be the most familiar with the stack here > can you take a look?

amogkam commented 1 month ago

I think @justinvyu would be most familiar with the more recent storage path logic