iterative / gto

🏷️ Git Tag Ops. Turn your Git repository into Artifact Registry or Model Registry.
https://dvc.org/doc/gto
Apache License 2.0
140 stars 16 forks source link

Lightweight tags and tag namespaces #127

Open aguschin opened 2 years ago

aguschin commented 2 years ago

Lightweight tags are created with simple git tag TAGNAME and doesn't contain information about creator, date, etc. GTO doesn't create them, but user can. They can register versions or promote something.

Now they're ignored by GTO, although they still can trigger CI and though they look pretty similar to annotated ones for unexperienced user. We need to support them I guess.

I would suggest reading and interpreting them by default, but supporting some flag --annotated to filter them out. Downside of this is that with lightweight tags you don't know date of promotion, author, etc, so gto history and gto show myartifact will have some fields blank.

omesser commented 2 years ago

@aguschin - Can I ask why is this needed? Is there a use case? They lack mandatory information by definition - either limiting us to not have mandatory field, or, create inconsistency.

Also they usually are meant for local usage and not for collaboration, and it's a bad practice to push them to remotes - sounds like local inconsistencies is not the wanted property from a registry - usually all about consistent and shared source of truth to rely on in production ops.

Partially unrelated - Maybe we should even prefix gto's tags with gto: so as to avoid accidental mixup between gto's tags and other annotated tags?

aguschin commented 2 years ago

@omesser thanks for your feedback!

@aguschin - Can I ask why is this needed? Is there a use case? They lack mandatory information by definition - either limiting us to not have mandatory field, or, create inconsistency.

No clear use case, I just think it can be confusing for users why the have some tag, but GTO doesn't take it into account. E.g. you see rf@v0.1.2 LW tag, but don't see a v0.1.2 version for rf. Or you trigger CI with manually created LW tag rf#prod-5, successfully deploy your model with CI, and then find out that GTO don't know about promotion.

My take was that we should encourage users to create annotated tags and don't create LW ones, but still offer some way to support them. Do you argue that we don't need to support them at all? Or that by default we shouldn't support them and instead of --annotated have --lightweight that will take into account LW tags?

Partially unrelated - Maybe we should even prefix gto's tags with gto: so as to avoid accidental mixup between gto's tags and other annotated tags?

🤔 this will remove simplicity @dmpetrov wants to have, but we can make this optional I guess. On the other hand, may be it's ok to parse other tags also - e.g. if user have convention to register versions with name@version he'll also see his things in GTO.

As a side note, I saw some tools that use this structure, but tag conventions are different (related to #114). E.g.

  1. rushstack (it's a tool). Tag example: @rushstack/heft-jest-plugin_v0.2.14
  2. Lerna (it's a tool), example repo. Tag example: react-scripts@5.0.1. Another example repo: jupyterlab. Tag example: @jupyterlab/statusbar@3.3.4
  3. Changesets (tool), example repo. Tag example: @emotion/babel-plugin@11.9.2

We can't support all of these, I believe (maybe some cmd flags can help?). Right now we are compatible with 2, 3.

omesser commented 2 years ago

Thanks for the quick answer @aguschin !

LW tags

"No clear use case, I just think it can be confusing for users why the have some tag, but GTO doesn't take it into account. E.g. you see rf@v0.1.2 LW tag, but don't see a v0.1.2 version for rf. Or you trigger CI with manually created LW tag rf#prod-5, successfully deploy your model with CI, and then find out that GTO don't know about promotion."

Why would users create tags manually and not by gto in the first place here? and how are they supposed to know about the restrictions and format of gto "managed" tags? It sounds like advanced usage to me personally - using gto would be easier, safer and simpler than creating tags manually for most users.

And of course, there's the other direction - What if the users want to create tags to trigger CI/CD or other reasons but don't want / don't foresee the effects in gto / the "registry"? How will they "opt out" if gto takes over all tags in the repo?

My biggest concern with LW tags is their tendency to be local. My local gto run would show me things that won't show on anyone else's checkout (or in studio) because they weren't pushed to upstream repo

About namespacing tags

"this will remove simplicity @dmpetrov wants to have"

I'm not sure picking up all tags makes things simpler or actually more complicated, it might make it worse for the team collaborating, if people start mixing formats and doing things manually by directly manipulating tags. Because a very real and likely possibility is that they also will manually remove tags to deregister (natural) and modify them instead of adhering to prod-N convention, and that's not how we want it to work in gto. it will make us lose history which is most of the value gto gives with those conventions over using git tags directly

Suggestions

So I would suggest a logical way to take this is:

Would love to get more opinions on this (CC @aguschin @dmpetrov @shcheklein @casperdcl @mike0sv ) - Is "clean looking" tags the priority here for a default behavior? I personally tend to favor healthy behavior in advanced/production use cases / large teams when things get "hairy" over the "demo mode" scenarios, and don't think a gto prefix takes away from that significantly. Missing proper namespace / over simplifying can cause frustration when users start integrating tools in their real environments and start hitting walls / edge-cases only after scaling their usage, when it's too late or expensive to modify modus operandi.

aguschin commented 2 years ago

Two thoughts in continuation of this discussion

Namespaces

After some thinking, I would suggest what's came up in Product Sync call - using namespaces as types. Using model and dataset namespace makes sense to me to easily (and without artifacts.yaml) identify artifact types, although it may have different sense in Lerna/Changesets (namespaces there are namespaces on NPM).

If we use model/nn@v1.2.3 + dataset/train@v2.3.4 instead of model@v1.2.3 + dataset@v1.2.3, it will make this type approach thing more general. I see two downsides:

Special names

On the other hand, I don't think hardcoding model and dataset as special names make sense in GTO to avoid messing with Studio BE. Studio should query GTO API to get all artifacts anyway, and then will filter them by "type". When filtering, BE can show special aliases like "model" together with type=model.

One downside of this is having different things shown in GTO and Studio. But I have troubles suggesting a case when this will break something when you have a single artifact of type "model" in repo. And when you have artifacts.yaml, it is not hard to add name "model" there and put type "model" to it, so everything works similarly in Studio and in CLI.

jellebouwman commented 2 years ago

Thanks for this discussion @aguschin & @omesser - this thread came up during yesterday's Studio Front-end Sync when we were discussing the last steps we need to perform to get Models into the Studio View table. Studio might also benefit from what is being proposed here. I don't think Studio's implementation needs to prioritised, but it might be nice to consider it.

Studio context

As you can see in this screenshot, Screenshot 2022-04-19 at 09 47 03 ([Link to Figma design](https://www.figma.com/file/vH5PkxRqQwYIFJxwK9AjG7/Model-Registry?node-id=52%3A8625))

for each commit row in the table, in the first cell we show a commit name. The commit name consists of any tags associated with the commit. In addition to this existing behaviour, we want to show the GTO model version inside the model's column cell. To be able to reliably filter out the GTO tag that describes the model version, having these namespaced tags would make this job easier. cc @Suor

aguschin commented 2 years ago

Also another thought: right now our convention for versions is model@v1.0.0, with v. Lerna/Changesets don't use v, and have tags like model@0.0.1. If we want to be compatible with them for some reason, we need to remove v. If we, otherwise, want to be different from them - we may keep that v as one differentiator.

omesser commented 2 years ago

Please see this discussion about v prefix. It's not officially part of the version/semver but it is a super common way to "tell" that the trailing string denotes a version. And it is common practice in git tags: https://git-scm.com/book/en/v2/Git-Basics-Tagging

I suggest we keep it. It looks like we're still discussing the prefixes and namespaces - and it would be weird if we end up supporting 1.0.0 tag and not v1.0.0 tag. I also don't think it takes away value - model@v1.0.0 looks better than model@1.0.0 to me, as I'm used to seeing v prefixes, and it clearly denotes the trailing number is a version, which it is

casperdcl commented 2 years ago

I prefer v too - some version-parsing tools expect it

aguschin commented 1 year ago

For the record, how we may handle namespaces is now affected by current decision to support monorepos: tag models=mymodel references artifact mymodel annotated in models/dvc.yaml.