Closed garymm closed 1 year ago
I guess this is an issue with MLFlow behaving differently with different backend stores and not the import. In theory the import could try to work around this by de-duplicating metric values when exporting and / or importing. I'll try to do that on my own before exporting.
I have managed to reproduce the error. Most probably an underlying MLflow issue since the export-import tool simply calls public APIs. Can you send the source code that creates the run?
Here's the MLflow tracking server schema: https://github.com/amesar/mlflow-resources/blob/master/MLflow_FAQ.md#mlflow-database-schema-mysql
CREATE TABLE `metrics` (
`key` varchar(250) NOT NULL,
`value` double NOT NULL,
`timestamp` bigint(20) NOT NULL,
`run_uuid` varchar(32) NOT NULL,
`step` bigint(20) NOT NULL DEFAULT '0',
`is_nan` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`key`,`timestamp`,`step`,`run_uuid`,`value`,`is_nan`),
KEY `index_metrics_run_uuid` (`run_uuid`),
CONSTRAINT `metrics_ibfk_1` FOREIGN KEY (`run_uuid`) REFERENCES `runs` (`run_uuid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The primary key is PRIMARY KEY (`key`,`timestamp`,`step`,`run_uuid`,`value`,`is_nan`)
.
There is a primary key constraint violation. Not sure how that could happen since the tool simply reads in the previous values that were correctly stored into the previous database.
BTW, before you bomb:
select count(*) from metrics;
+----------+
| count(*) |
+----------+
| 70074 |
+----------+
So I dropped the primary key on the metrics
table in the "imported database" and loaded all your metrics. There are 4630 instances of clashing primary keys (see attachment). How this got into the original database is weird. If you want to debug it, I'd suggest you set up a scratch database and drop the primary key with alter table metrics DROP PRIMARY KEY;
. Then train your model and execute this query to see if there any duplicates:
SELECT `key`, `timestamp`, `step`, `run_uuid`, `value`, `is_nan`, count(*)
FROM metrics
GROUP BY `key`, `timestamp`, `step`, `run_uuid`, `value`, `is_nan`
HAVING COUNT(*) > 1
OK, I found a problem. The timestamps in your imported metrics table are not legitimate "milliseconds since the Unix epoch". There seems to be extra zero at the end. For example, 16686312949530 instead of legitimate 1668631294953. If I convert the timestamp to datetime I get: Tuesday, October 7, 2498 10:55:49.570 AM [GMT-04:00]
. Without the trailing 0 I get Wednesday, November 16, 2022 3:41:34.957 PM [GMT-05:00]
. How that happened beats me. I would check the original timestamps in your source database: select timestamp from metrics limit 10
.
Thanks for looking into this! The source is not a database, it's a file store. The file store doesn't have any constraint on duplicate entries. I fixed the bug that was causing the invalid timestamps but even then the database rejects duplicate entries. I think the root cause here is different behavior between the file store and SQL store, which is an MLFlow issue, not an import-export issue, but the import could in theory sanitize the inputs by removing duplicates. Anyways I've changed my code to do this sanitization, so I'm not blocked. You can feel free to keep this open if you want to do the sanitization, or close it if you don't. Thanks again!
I would never use the file store - it certainly isn't meant to be production quality. Besides, it doesn't even support the model registry. I never understood why MLflow has such a weird creature. The simplest and safe way is to just use en embedded sqlite database.
Running
import-run
with the attached directory succeeds withMLFLOW_TRACKING_URI
set to some local directory, but fails when it's set to a tracking server that is backed by SQL.I get this error:
mlflow-export-9198135f4a4c40ccb76a8c2ae8c61d8a.tar.gz