mlflow / mlflow-export-import

Apache License 2.0
110 stars 70 forks source link

Add start_date and end_date in importer for runs #72

Open almajo opened 1 year ago

almajo commented 1 year ago

Currently start_date and end_date of the exported runs is not used. Instead, the run is created with the current timestamp. This messes the whole imported library up.

A fix is simple as we only need to pass the kwargs to the client methods that are used anyway.

amesar commented 1 year ago

end_time is a strictly read-only value set by the system. start_time is also set by the system with an option in create_run() to pass in a time though w/o being able to set start_time this doesn't make sense. What kwargs and what methods are you refering to?

You can still access the original start_time and end_time values if you use the import option --import-source-tags. It will save the values as tags in the destination run such as "mlflow_exim.run_info.start_time". See https://github.com/mlflow/mlflow-export-import/blob/master/README_governance.md.

almajo commented 1 year ago
def _import_run(self, dst_exp_name, input_dir, dst_notebook_dir):
        exp_id = mlflow_utils.set_experiment(self.mlflow_client, self.dbx_client, dst_exp_name)
        exp = self.mlflow_client.get_experiment(exp_id)
        src_run_path = os.path.join(input_dir,"run.json")
        src_run_dct = io_utils.read_file_mlflow(src_run_path)

        run = self.mlflow_client.create_run(exp.experiment_id)
        run_id = run.info.run_id
        try:
            self._import_run_data(src_run_dct, run_id, src_run_dct["info"]["user_id"])
            path = os.path.join(input_dir, "artifacts")
            if os.path.exists(_filesystem.mk_local_path(path)):
                self.mlflow_client.log_artifacts(run_id, mk_local_path(path))
            if self.mlmodel_fix:
                self._update_mlmodel_run_id(run_id)
            self.mlflow_client.set_terminated(run_id, RunStatus.to_string(RunStatus.FINISHED))
            run = self.mlflow_client.get_run(run_id)
        except Exception as e:
            self.mlflow_client.set_terminated(run_id, RunStatus.to_string(RunStatus.FAILED))
            import traceback
            traceback.print_exc()
            raise MlflowExportImportException from e
        if utils.importing_into_databricks() and dst_notebook_dir:
            ndir = os.path.join(dst_notebook_dir, run_id) if self.dst_notebook_dir_add_run_id else dst_notebook_dir
            self._upload_databricks_notebook(input_dir, src_run_dct, ndir)

        return (run, src_run_dct["tags"].get(MLFLOW_PARENT_RUN_ID,None))

I'm referring to passing src_run_dct["info"]["start_time"] to the create_runmethod for the start and passing src_run_dct["info"]["end_time"] to the kwarg end_timeof .set_terminated(..., end_time=XX)

I'm aware that I can transfer those as extra tags with the CLI option, however, I prefer having an ordered list in my experiment overview (ordered by start_date).

jrakotobe commented 8 months ago

I believe what @almajo requested would be the expected behavior in a lot of cases, such as transfering all my data from a filesystem to a SQL database. Since the functions create_run and set_terminated already accept start_time and end_time as respective argument, why not add an option for using the imported times (that are already stored when exporting) instead of the system times?

slnc commented 1 month ago

I'm with @jrakotobe and @almajo. This bug corrupts your data bug which seems like a major issue for the official import/export tool, and it locks people into whatever backend they choose at installation time.

@amesar is there anything that would prevent a fix PR like @almajo's from being merged? I'm happy to send one if no one else wants to take on it.