iterative / gto

🏷️ Git Tag Ops. Turn your Git repository into Artifact Registry or Model Registry.
https://dvc.org/doc/gto
Apache License 2.0
140 stars 16 forks source link

`gto add` and `artifacts.yaml` arguments/fields #122

Closed dberenbaum closed 2 years ago

dberenbaum commented 2 years ago

Some questions/feedback on gto add and artifacts.yaml:

My biggest question is whether the gto add TYPE NAME PATH really needs all 3 required arguments. It's a lot to remember and type to add an artifact, and it's not clear why all 3 are required and how each is useful. Related questions:

  1. Why not make a --type option instead?
  2. Should artifacts be defined by path or name? It's unclear if names are intended to be aliases for paths, or if names are the primary artifact identifiers, in which case it's unclear why paths are required. If paths aren't required, --virtual is probably not even necessary since external artifacts could be added without paths.
  3. Why is the name listed both as a top-level key in artifacts.yaml and also under each key in the name field? Are both needed?

One smaller unrelated question:

  1. Could tags be renamed to something else like labels to avoid confusion with git tags?
aguschin commented 2 years ago

Thanks for your feedback, @dberenbaum!

  1. I think we can do that. One question: let's suppose you have some artifacts: one of type "model", another of type "dataset" and several without any type. You could expect to list models with gto show --type model or even gto show model, but how would you expect listing artifacts without type?

  2. My take is that they should be defined by names. It makes git tags look beautiful. Besides that, I never seen git tags like path/to/something@v1.0.0. Finally, from technical perspective, git tags that contain slashes / may cause some issues (I would add such a restriction on artifact names, but it's not in place right now). The downside of using names is that you cannot take existing commit and add a path to the artifact, if you didn't commit artifacts.yaml in advance. The question is do you need this path at all? You could hardcode it in your CI as a workaround.

2a. Path could not required for virtual artifact, I agree. WDYT about these examples?

$ gto add rf --type model  # (no path) adds enrichment for virtual artifact
$ gto add rf --path path/to/it  # (with path) adds enrichment for virtual artifact
$ gto add rf --path path/to/it  --must-exist  # (with path) checks that path is committed, then adds enrichment for NON-virtual artifact

^ by default we assume artifact is virtual, until you say it's not. Then only we verify existence.

other option is to just check if path is supplied, check that path is committed, and put virtual flag automatically:

$ gto add rf --path path/to/it  # if path exists, then write it's not a virtual artifact. If don't exists, write it's a virtual.
  1. It's a technical issue, I'll resolve that, #84

  2. Sounds reasonable, let's call them labels.

cc @dmpetrov

dberenbaum commented 2 years ago
1. I think we can do that. One question: let's suppose you have some artifacts: one of type "model", another of type "dataset" and several without any type. You could expect to list models with `gto show --type model` or even `gto show model`, but how would you expect listing artifacts without type?

I think gto show should include those artifacts but gto show --type should always exclude them.

2. My take is that they should be defined by names. It makes git tags look beautiful. Besides that, I never seen git tags like `path/to/something@v1.0.0`. Finally, from technical perspective, git tags that contain slashes `/` may cause some issues (I would add such a restriction on artifact names, but it's not in place right now).
   The downside of using names is that you cannot take existing commit and add a path to the artifact, if you didn't commit `artifacts.yaml` in advance. The question is do you need this path at all? You could hardcode it in your CI as a workaround.

I agree that the workflow of having to make an extra commit to start tracking an artifact path feels awkward. It reminds me of the mlflow experiment tracking problem where you need to log the commit as part of your run and then save the results of the run in a separate commit.

Should the commit history of artifacts.yaml matter? What if you always use artifacts.yaml from the current workspace to describe the model? This makes it easy to update paths, descriptions, etc. without having to change the commit history. This is similar to how dvc plots works, where dvc uses the git history to collect the data but renders the plots based on the current plot configuration. The only downside to this approach has been that it can make things slightly confusing in Studio, where there's no current workspace (but often the head of the default branch is sufficient).

2a. Path could not required for virtual artifact, I agree. WDYT about these examples?


$ gto add rf --type model  # (no path) adds enrichment for virtual artifact

I think allowing this is probably enough for now. The other ideas about how to handle --path for virtual artifacts seem like they don't change enough to be worth worrying about for now.

aguschin commented 2 years ago

Ok, first part of this issue was implemented in #130.

@dberenbaum Now, about using artifacts.yaml from HEAD. While it makes sense when you want to display something, as with dvc plots, or in Studio / GTO CLI (to show path/description of the HEAD/latest version first), I'm not sure about how broadly we should extend this idea.

I guess when you call gto show --type model, we can TAL at the latest registered version or HEAD. But that doesn't mean when you want to, let's say, deploy your model from HEAD^10, you can take path from HEAD - it may changed already.

One more question. If you do gto history --type model and you have an artifact that was of type cnn in some commits and model in others, do you expect to see part of the history when it was model? Or do you expect to see all its history? Do your expectations change if it's model in the latest version (or HEAD), or if it's a cnn?

dberenbaum commented 2 years ago

Thanks, @aguschin. Let's forget about using artifacts.yaml from HEAD for now. I was thinking it could be useful for certain ephemeral info like descriptions, but I don't think it's necessary and probably adds more confusion for now.

dberenbaum commented 2 years ago

@dmpetrov mentioned the ability to register a default model/dataset without a name. Is that implemented, and how does/will that work? Is it gto register model?

I'm not clear on whether that causes any issues with the suggested changes here to make names the unique artifact keys and make types optional.

aguschin commented 2 years ago

Almost, but it's not artifact "type=model without name", instead it's "name=model without type". Which will be treated specially. Basically, Studio will join output of gto show --type model and gto show --name model to show in Model Registry.

(Just want to note here that we have related discussion going on in #127)

dberenbaum commented 2 years ago

Okay, let's close and discuss over there

jorgeorpinel commented 2 years ago

Just found this. Interesting discussion

I never seen git tags like path/to/something@v1.0.0. Finally, from technical perspective, git tags that contain slashes / may cause some issues

You can use hierarchical tags. In fact I think that that at least GTO tags should at least be grouped into gto/alias@version format. This could be useful for other systems to avoid clutter when displaying Git tags (hypothetically).

Then we can recommend that the entire refs/gto/ dir. gets protected in Github, Gitlab, etc.