iterative / gto

🏷️ Git Tag Ops. Turn your Git repository into Artifact Registry or Model Registry.
https://dvc.org/doc/gto
Apache License 2.0
140 stars 16 forks source link

How to migrate the artifacts registered with GTO `v0.2.*` #366

Closed rnoxy closed 11 months ago

rnoxy commented 1 year ago

Assume we have a repo with artifacts registered with GTO v0.2.x (with file artifacts.yaml). After upgrading GTO to v0.3.x one should change the artifacts.yaml to dvc.yaml and start using dvc.api (in Python) in order to get gto describe or gto annotate functionalities.

The question is, how to use new dvc.api with old artifacts, registered with GTO v0.2.x. The problem is that dvc.api expects the file dvc.yaml in the repo.Index.

How to migrate all registered artifacts? Shall we rebase all commits and re-register the artifacts again?

Any script for this process would be helpful.

aguschin commented 1 year ago

Hi @rnoxy. Do you use your artifacts from CLI or from Studio? We made Studio support both formats (old and new), but for CLI I was assuming you either use old GTO CLI or new DVC API, not trying to do both at once.

There is this script that moves annotations from artifacts.yaml to dvc.yaml, but to move annotations for existing commits you'll need to rewrite git repo history, which will be a complex task.

If you need to work with both old and new formats, I guess the easiest option would be to use some kind of try...except construction, trying to use new format, and if annotation doesn't exist, fall back to the old annotation.

Hope this is helpful. You can also ask other GTO users that participated in https://github.com/iterative/gto/issues/337, they may have better ideas how to handle that.

rnoxy commented 1 year ago

Hi @aguschin, I am using pure GTO with DVC by CLI and recently with Python API, only. We use DVC only for data version control, not for experiments, pipelines, ... We do not have Studio, as well.

I really liked the approach with gto describe and gto annotate. I do not understand why such commands were removed.

I think I can implement them by my own with Python git.Repo. For example, here is some first approach, which searches for artifacts.yaml and dvc.yaml

def _get_dvc_artifact_path(fs: DVCFileSystem, artifact_name: str) -> Optional[str]:
    """
    Load the artifacts YAML file from the DVC repository and return the path to the artifact.

    This method is compatible with GTO v0.2.x (artifacts.yaml) and v0.3.x (dvc.yaml)
    format of the artifacts YAML file.

    In case of any error (e.g. the file is not found, the artifact is not found, etc.)
    None is returned.

    Args:
        fs: The DVCFileSystem object
        artifact_name: The name of the artifact to load.
    """
    import yaml
    from dvc.scm import RevError
    for artifacts_filename in ["artifacts.yaml", "dvc.yaml"]:
        try:
            with fs.open(artifacts_filename) as f:
                artifacts = yaml.safe_load(f)
                # In `artifacts.yaml` the artifacts are at the root level
                # In `dvc.yaml` the artifacts are under the "artifacts" key
                if artifacts_filename == "dvc.yaml":
                    artifacts = artifacts["artifacts"]
                return artifacts[artifact_name].get("path")
        except (KeyError, FileNotFoundError):
            pass
        except RevError:
            break
    return None