ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.1k stars 1.19k forks source link

MLflow integration fails to log metadata #3045

Open dragosmc opened 1 year ago

dragosmc commented 1 year ago

Describe the bug I'm running ludwig 0.6.4 and mlflow 2.1.1 and I get a warning about ludwig being unable to log metadata due to some mlflow limitation.

To Reproduce

from ludwig.contribs import MlflowCallback

ludwig_config = {}
model = LudwigModel(config=ludwig_config, callbacks=[MlflowCallback(tracking_uri=MLFLOW_URL)])
model.train(data)

Expected behavior Metadata to be logged successfully into MLflow.

Screenshots

mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: Tag value '[{"run_id": "1ec3b4d585774bc683804a805da0fa82", "artifact_path": "model", "utc_time_created": "2023-02-03 15:15:43.143908", "flavors": {"python_function": {"env": "conda.yaml", "loader_module": "ludwig.contribs.mlflow.model", "python_version": "3.9.1' had length 6364, which exceeded length limit of 5000
2023/02/03 15:19:16 WARNING mlflow.models.model: Logging model metadata to the tracking server has failed, possibly due older server version. The model artifacts have been logged successfully under production-mlflow-artifacts/7/1ec3b4d585774bc683804a805da0fa82/artifacts. In addition to exporting model artifacts, MLflow clients 1.7.0 and above attempt to record model metadata to the tracking store. If logging to a mlflow server via REST, consider upgrading the server version to MLflow 1.7.0 or above. Set logging level to DEBUG via `logging.getLogger("mlflow").setLevel(logging.DEBUG)` to see the full traceback.

and traceback

Traceback (most recent call last):
  File "/Users/dragos/opt/anaconda3/lib/python3.9/site-packages/mlflow/models/model.py", line 489, in log
    mlflow.tracking.fluent._record_logged_model(mlflow_model)
  File "/Users/dragos/opt/anaconda3/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 985, in _record_logged_model
    MlflowClient()._record_logged_model(run_id, mlflow_model)
  File "/Users/dragos/opt/anaconda3/lib/python3.9/site-packages/mlflow/tracking/client.py", line 1370, in _record_logged_model
    self._tracking_client._record_logged_model(run_id, mlflow_model)
  File "/Users/dragos/opt/anaconda3/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 404, in _record_logged_model
    self.store.record_logged_model(run_id, mlflow_model)
  File "/Users/dragos/opt/anaconda3/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 325, in record_logged_model
    self._call_endpoint(LogModel, req_body)
  File "/Users/dragos/opt/anaconda3/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 56, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/Users/dragos/opt/anaconda3/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 281, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/Users/dragos/opt/anaconda3/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 207, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: Tag value '[{"run_id": "1ec3b4d585774bc683804a805da0fa82", "artifact_path": "model", "utc_time_created": "2023-02-03 15:15:43.143908", "flavors": {"python_function": {"env": "conda.yaml", "loader_module": "ludwig.contribs.mlflow.model", "python_version": "3.9.1' had length 6364, which exceeded length limit of 5000

Environment (please complete the following information):

Additional context The related issue which won't be fixed from MLflow's side https://github.com/mlflow/mlflow/issues/2892

I can give a hand to fixing this also.

arnavgarg1 commented 1 year ago

Hi @dragosmc, thanks for flagging this! This is actually a known issue that we've seen on our side as well, and we're going to work on fixing it in the future.

For now, to unblock yourself, are you able to downgrade MLFlow to 1.30.0? This should work - let me know if it does

arnavgarg1 commented 1 year ago

I can give a hand to fixing this also.

@dragosmc If you want to take a stab at fixing this in Ludwig, that would be amazing

dragosmc commented 1 year ago

Hi @dragosmc, thanks for flagging this! This is actually a known issue that we've seen on our side as well, and we're going to work on fixing it in the future.

For now, to unblock yourself, are you able to downgrade MLFlow to 1.30.0? This should work - let me know if it does

Unfortunatelly I can't downgrade to 1.30, but I'd be happy to help fixing this -I'll have a look through the code and come up witha PR fairly soon.

Thanks.

tgaddair commented 1 year ago

Thanks @dragosmc! All the relevant code should be contained in https://github.com/ludwig-ai/ludwig/blob/master/ludwig/contribs/mlflow/__init__.py#L38

dragosmc commented 1 year ago

I had a go at this and after my digging I believe the problem lies with MLflow. From what I can see the ludwig calls Model.log() which then splits/deals with the data as it wishes.

Moreover, the error message is misleading since the exception is raised during the /2.0/mlflow/runs/log-model call and not specifically creating a tag.

I will have to dig a bit more into MLflow itself to understand where exactly the json payload gets split into tags vs no-tags when logging it, but as it stands now I couldn't get this to work with 2.1.1 or 1.30.0.