[Tune][Air][RLlib] MLflowLoggerCallback passes nested dicts down to MLflow which can raise error for large configs

bruno-hoermann commented 1 year ago

What happened + What you expected to happen

When using rllib together with tune and the MLflowLoggerCallback, mlflow raises an error for large env_config dicts inside the config passed to tune.

Why? Sub-dicts of the tune config are converted to strings by the mlflow backend. So the contents of the env_config becomes a very long string, e.g.: "{'env_param1': 1, 'env_param2': 'something', [...]}". Mlflow has (as I understand it) a hard coded value of 500 for the maximum character length for logged values, which is easily exceeded when having a few parameters in the config.

I propose to recursively flatten the config and all sub-dicts before passing it to mlflow, so that every key in sub-dicts becomes its own mlflow parameter. This could be done in ray.air.callbacks.mlflow:MLflowLoggerCallback.log_trial_start().

Versions / Dependencies

Python 3.10.0 Ubuntu 18.04

Likely relevant packages: mlflow==1.30.0 ray==2.1.0

others: absl-py 1.3.0 aiohttp 3.8.3 aiohttp-cors 0.7.0 aiosignal 1.3.1 ale-py 0.7.5 alembic 1.8.1 astroid 2.12.13 async-timeout 4.0.2 attrs 22.1.0 autopep8 1.6.0 AutoROM 0.4.2 AutoROM.accept-rom-license 0.4.2 black 22.12.0 blessed 1.19.1 CacheControl 0.12.11 cachetools 5.2.0 certifi 2022.12.7 cffi 1.15.1 charset-normalizer 2.1.1 cleo 2.0.1 click 8.0.4 cloudpickle 2.2.0 cmake 3.25.2 colorful 0.5.5 contourpy 1.0.6 crashtest 0.4.1 cryptography 39.0.1 cycler 0.11.0 Cython 0.29.32 databricks-cli 0.17.4 dill 0.3.6 distlib 0.3.6 dm-tree 0.1.7 docker 6.0.1 dulwich 0.20.50 entrypoints 0.4 fasteners 0.18 filelock 3.8.2 Flask 2.2.2 fonttools 4.38.0 frozenlist 1.3.3 gitdb 4.0.10 GitPython 3.1.29 glfw 2.5.5 google-api-core 2.11.0 google-auth 2.15.0 google-auth-oauthlib 0.4.6 googleapis-common-protos 1.57.0 gpustat 1.0.0 greenlet 2.0.1 grpcio 1.51.1 gunicorn 20.1.0 gym 0.25.2 gym-notices 0.0.8 html5lib 1.1 idna 3.4 imageio 2.22.4 importlib-metadata 5.1.0 importlib-resources 5.10.1 isort 5.11.1 itsdangerous 2.1.2 jaraco.classes 3.2.3 jeepney 0.8.0 Jinja2 3.1.2 jsonschema 4.17.3 keyring 23.13.1 kiwisolver 1.4.4 lazy-object-proxy 1.8.0 libtorrent 2.0.7 lockfile 0.12.2 lz4 4.0.2 Mako 1.2.4 Markdown 3.4.1 MarkupSafe 2.1.1 matplotlib 3.6.2 mccabe 0.7.0 mlflow 1.30.0 more-itertools 9.0.0 msgpack 1.0.4 mujoco 2.3.1.post1 mujoco-py 2.1.2.14 multidict 6.0.3 mypy 0.982 mypy-extensions 0.4.3 numpy 1.23.5 nvidia-ml-py 11.495.46 oauthlib 3.2.2 opencensus 0.11.0 opencensus-context 0.1.3 opencv-python 4.6.0.66 packaging 21.3 pandas 1.5.2 pathspec 0.10.3 pexpect 4.8.0 Pillow 9.3.0 pip 22.3.1 pkginfo 1.9.6 platformdirs 2.6.0 poetry 1.3.2 poetry-core 1.4.0 poetry-plugin-export 1.3.0 prometheus-client 0.13.1 prometheus-flask-exporter 0.21.0 protobuf 3.20.3 psutil 5.9.4 ptyprocess 0.7.0 py-spy 0.3.14 pyasn1 0.4.8 pyasn1-modules 0.2.8 pybind11 2.10.3 pycodestyle 2.10.0 pycparser 2.21 pydantic 1.10.2 pygame 2.1.2 PyJWT 2.6.0 pylint 2.15.8 PyOpenGL 3.1.6 pyparsing 3.0.9 pyrsistent 0.19.2 python-dateutil 2.8.2 pytz 2022.6 PyYAML 6.0 querystring-parser 1.2.4 rapidfuzz 2.13.7 ray 2.1.0 requests 2.28.1 requests-oauthlib 1.3.1 requests-toolbelt 0.10.1 rsa 4.9 scipy 1.9.3 SecretStorage 3.3.3 setuptools 65.6.3 shellingham 1.5.0.post1 six 1.16.0 smart-open 6.3.0 smmap 5.0.0 SQLAlchemy 1.4.45 sqlparse 0.4.3 swig 4.1.1 tabulate 0.9.0 tensorboard 2.11.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.11.6 torch 1.12.1 tqdm 4.64.1 trove-classifiers 2023.2.8 typing_extensions 4.4.0 urllib3 1.26.13 virtualenv 20.17.1 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.4.2 wheel 0.38.4 wrapt 1.14.1 yarl 1.8.2 zipp 3.11.0

Reproduction script


config = {
        "env": env_name,
        "env_config": {"one_of_many_parameters": 1, [...]}
    }

ray.tune.run(
    config=config,
    callbacks=[
        MLflowLoggerCallback(
            experiment_name= "name",
            save_artifact=True,
        )
    ],
)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

gjoliver commented 1 year ago

thanks for the detailed bug report. this is very interesting. we should handle integrations with these experimentation frameworks better.

richardliaw commented 1 year ago

@bruno-hoermann thanks a bunch. We will try to address this in 2.5.

ray-project / ray