mlflow / mlflow-export-import

Apache License 2.0
132 stars 78 forks source link

import-run fails when there are duplicate metric values with SQL backend store #48

Closed garymm closed 1 year ago

garymm commented 1 year ago

Running import-run with the attached directory succeeds with MLFLOW_TRACKING_URI set to some local directory, but fails when it's set to a tracking server that is backed by SQL.

I get this error:

Options:
  input_dir: /tmp/mlflow-export-9198135f4a4c40ccb76a8c2ae8c61d8a
  experiment_name: instinct
  mlmodel_fix: True
  use_src_user_id: False
  dst_notebook_dir: None
  dst_notebook_dir_add_run_id: None
in_databricks: False
importing_into_databricks: False
Importing run from '/tmp/mlflow-export-9198135f4a4c40ccb76a8c2ae8c61d8a'
Traceback (most recent call last):
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/import_run.py", line 70, in _import_run
    self._import_run_data(src_run_dct, run_id, src_run_dct["info"]["user_id"])
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/import_run.py", line 105, in _import_run_data
    run_data_importer.log_metrics(self.mlflow_client, run_dct, run_id, MAX_METRICS_PER_BATCH)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/run_data_importer.py", line 38, in log_metrics
    _log_data(run_dct, run_id, batch_size, get_data, log_data)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/run_data_importer.py", line 19, in _log_data
    log_data(run_id, batch)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/run_data_importer.py", line 37, in log_data
    client.log_batch(run_id, metrics=metrics)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/tracking/client.py", line 1099, in log_batch
    self._tracking_client.log_batch(run_id, metrics, params, tags)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/tracking/_tracking_service/client.py", line 415, in log_batch
    self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/store/tracking/rest_store.py", line 341, in log_batch
    self._call_endpoint(LogBatch, req_body)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/store/tracking/rest_store.py", line 57, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/utils/rest_utils.py", line 280, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/utils/rest_utils.py", line 206, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: BAD_REQUEST: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(pymysql.err.IntegrityError) (1062, "Duplicate entry 'inside_3-16686312949540-0-557f04b0a45c4486b067a7e4205bb1b9-0-1' for key 'metrics.PRIMARY'")
[SQL: INSERT INTO metrics (`key`, value, timestamp, step, is_nan, run_uuid) VALUES (%(key)s, %(value)s, %(timestamp)s, %(step)s, %(is_nan)s, %(run_uuid)s)]
[parameters: ({'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994150, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994160, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994170, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994180, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994190, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994200, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0166667, 'timestamp': 16686312994200, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.016, 'timestamp': 16686312994210, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}  ... displaying 10 of 891 total bound parameter sets ...  {'key': 'inside_3', 'value': 0, 'timestamp': 16686312949680, 'step': 0, 'is_nan': 1, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_3', 'value': 0, 'timestamp': 16686312949690, 'step': 0, 'is_nan': 1, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'})]
(Background on this error at: https://sqlalche.me/e/14/gkpj)
Traceback (most recent call last):
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/import_run.py", line 70, in _import_run
    self._import_run_data(src_run_dct, run_id, src_run_dct["info"]["user_id"])
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/import_run.py", line 105, in _import_run_data
    run_data_importer.log_metrics(self.mlflow_client, run_dct, run_id, MAX_METRICS_PER_BATCH)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/run_data_importer.py", line 38, in log_metrics
    _log_data(run_dct, run_id, batch_size, get_data, log_data)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/run_data_importer.py", line 19, in _log_data
    log_data(run_id, batch)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow_export_import/site-packages/mlflow_export_import/run/run_data_importer.py", line 37, in log_data
    client.log_batch(run_id, metrics=metrics)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/tracking/client.py", line 1099, in log_batch
    self._tracking_client.log_batch(run_id, metrics, params, tags)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/tracking/_tracking_service/client.py", line 415, in log_batch
    self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/store/tracking/rest_store.py", line 341, in log_batch
    self._call_endpoint(LogBatch, req_body)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/store/tracking/rest_store.py", line 57, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/utils/rest_utils.py", line 280, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/Users/garymm/src/Astera-org/obelisk/bazel-bin/external/pip_mlflow_export_import/rules_python_wheel_entry_point_import-run.runfiles/pip_mlflow/site-packages/mlflow/utils/rest_utils.py", line 206, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: BAD_REQUEST: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(pymysql.err.IntegrityError) (1062, "Duplicate entry 'inside_3-16686312949540-0-557f04b0a45c4486b067a7e4205bb1b9-0-1' for key 'metrics.PRIMARY'")
[SQL: INSERT INTO metrics (`key`, value, timestamp, step, is_nan, run_uuid) VALUES (%(key)s, %(value)s, %(timestamp)s, %(step)s, %(is_nan)s, %(run_uuid)s)]
[parameters: ({'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994150, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994160, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994170, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994180, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994190, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0, 'timestamp': 16686312994200, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.0166667, 'timestamp': 16686312994200, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_2', 'value': 0.016, 'timestamp': 16686312994210, 'step': 0, 'is_nan': 0, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}  ... displaying 10 of 891 total bound parameter sets ...  {'key': 'inside_3', 'value': 0, 'timestamp': 16686312949680, 'step': 0, 'is_nan': 1, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'}, {'key': 'inside_3', 'value': 0, 'timestamp': 16686312949690, 'step': 0, 'is_nan': 1, 'run_uuid': '557f04b0a45c4486b067a7e4205bb1b9'})]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

mlflow-export-9198135f4a4c40ccb76a8c2ae8c61d8a.tar.gz

garymm commented 1 year ago

I guess this is an issue with MLFlow behaving differently with different backend stores and not the import. In theory the import could try to work around this by de-duplicating metric values when exporting and / or importing. I'll try to do that on my own before exporting.

amesar commented 1 year ago

I have managed to reproduce the error. Most probably an underlying MLflow issue since the export-import tool simply calls public APIs. Can you send the source code that creates the run?

Here's the MLflow tracking server schema: https://github.com/amesar/mlflow-resources/blob/master/MLflow_FAQ.md#mlflow-database-schema-mysql

amesar commented 1 year ago
CREATE TABLE `metrics` (
  `key` varchar(250) NOT NULL,
  `value` double NOT NULL,
  `timestamp` bigint(20) NOT NULL,
  `run_uuid` varchar(32) NOT NULL,
  `step` bigint(20) NOT NULL DEFAULT '0',
  `is_nan` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`key`,`timestamp`,`step`,`run_uuid`,`value`,`is_nan`),
  KEY `index_metrics_run_uuid` (`run_uuid`),
  CONSTRAINT `metrics_ibfk_1` FOREIGN KEY (`run_uuid`) REFERENCES `runs` (`run_uuid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
amesar commented 1 year ago

The primary key is PRIMARY KEY (`key`,`timestamp`,`step`,`run_uuid`,`value`,`is_nan`).

There is a primary key constraint violation. Not sure how that could happen since the tool simply reads in the previous values that were correctly stored into the previous database.

BTW, before you bomb:

select count(*) from metrics;
+----------+
| count(*) |
+----------+
|    70074 |
+----------+
amesar commented 1 year ago

So I dropped the primary key on the metrics table in the "imported database" and loaded all your metrics. There are 4630 instances of clashing primary keys (see attachment). How this got into the original database is weird. If you want to debug it, I'd suggest you set up a scratch database and drop the primary key with alter table metrics DROP PRIMARY KEY;. Then train your model and execute this query to see if there any duplicates:

SELECT `key`, `timestamp`, `step`, `run_uuid`, `value`, `is_nan`, count(*)
FROM metrics
GROUP BY `key`, `timestamp`, `step`, `run_uuid`, `value`, `is_nan`
HAVING COUNT(*) > 1

query_mlflow_garymm.log

amesar commented 1 year ago

OK, I found a problem. The timestamps in your imported metrics table are not legitimate "milliseconds since the Unix epoch". There seems to be extra zero at the end. For example, 16686312949530 instead of legitimate 1668631294953. If I convert the timestamp to datetime I get: Tuesday, October 7, 2498 10:55:49.570 AM [GMT-04:00]. Without the trailing 0 I get Wednesday, November 16, 2022 3:41:34.957 PM [GMT-05:00]. How that happened beats me. I would check the original timestamps in your source database: select timestamp from metrics limit 10.

garymm commented 1 year ago

Thanks for looking into this! The source is not a database, it's a file store. The file store doesn't have any constraint on duplicate entries. I fixed the bug that was causing the invalid timestamps but even then the database rejects duplicate entries. I think the root cause here is different behavior between the file store and SQL store, which is an MLFlow issue, not an import-export issue, but the import could in theory sanitize the inputs by removing duplicates. Anyways I've changed my code to do this sanitization, so I'm not blocked. You can feel free to keep this open if you want to do the sanitization, or close it if you don't. Thanks again!

amesar commented 1 year ago

I would never use the file store - it certainly isn't meant to be production quality. Besides, it doesn't even support the model registry. I never understood why MLflow has such a weird creature. The simplest and safe way is to just use en embedded sqlite database.