DVC+GTO integration - Githubissues

aguschin commented 1 year ago

Just had a discussion with @dberenbaum yesterday about this. Overall, feels like there are too many issues blocked by missing GTO/DVC integration. At the same time, I feel like GTO is not useful without DVC and is more like a plugin to DVC.

So, two options are possible: build an integration in GTO, or just merge GTO into DVC.

There are not many decent ways to actually store binaries in repo, which makes it hard to imagine using GTO without DVC in the real-world scenarios. I can recall these options:

Commit binaries to the repo - not a real option thou
Use git LFS
Use DVC
Commit them manually somewhere (could be s3, artifactory or something)

Supporting option 4 in GTO to have upload/download functionality as in https://github.com/iterative/gto/issues/307 will require integrations with each place (for s3 it would be fsspec probably, for artifactory some python client of theirs, etc). It makes me think these integrations could be also part of DVC as well (dvc import-url, etc).

Now if we can't imagine a good use-case for GTO without DVC, or all those use-cases would require some machinery that could also be part of DVC, why not just merge GTO into DVC?

Asking your opinion @francesco086 @shortcipher3 @bgalvao since you're the most active GTO users I know about :)

(this is not something we need to decide right away, since we can just build this integration inside of GTO, and then merge - but I think merging could make it more straightforward. I'm just collecting opinions for now to have more detailed picture).

bgalvao commented 1 year ago

My perspective is that of someone who never used GTO without DVC, and also that of Unix philosophy:

Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".

Thus, I would say, let DVC do the data management and let GTO stay dedicated to managing git tags (for models). This is not to say that GTO cannot become merged into DVC as a subcommand - I imagine as dvc gto or something similar. Nonetheless, the separation of concerns would be there.

One point in favour of this separation of concerns is that DVC has a lot of stability and contribution behind artifact management - artifact as in binaries, data, etc. From a development point of view, I would rather leverage the foundation of DVC rather than risk reinventing the wheel DVC has built - but I am not a code contributor to any of these projects so take that with a grain of salt.

Second, would be visibility of the gto functionality: DVC has a bigger name. I mention this because I really want to see Git Tag Ops become a standard practice.

For me, only 3. is a real use-case.

As for 4., I don't know how DVC would handle potentially different remotes dissociated from git repositories, so I have no real take on that.

Cheers ✌🏼

francesco086 commented 1 year ago

Hi @aguschin thanks for asking my opinion, I am happy to share my thoughts on this.

I feel very much in line with @bgalvao concerning the Unix philosophy. However, the conclusions I come to are probably a bit different.

What isgto goal? Artifact registry or just manage git tags? If it is the second one then it has no right to stand on its own, integrate it with dvc (but allow me to fork first :P). I think its purpose should be the first one. But then it misses some functionalities, e.g. put and pick as in https://github.com/iterative/gto/issues/307.

What I think would be the right way to follow the Unix philosophy is to create an interface in gto to integrate other tools, like dvc. Allow users to write their own gto-plugins. And provide dvc as an example plugin.

If we assume new gto commands like in https://github.com/iterative/gto/issues/307, I can imagine an interface along these lines:

class GTOIntegration(ABC):

    def pre_put():
        pass

    def after_put():
        pass

    def pre_pick():
        pass

    def after_pick():
        pass

Btw, git commands could also be integrated in this way.

Then, one can call gto and specify the plugin with an option, e.g. gto put --plugin dvc (--plugin default value will be git, which would mean commit/push the file as-it-is).

I should also mention that I don't like how many things does dvc do right now, it's way too much, and breaks the idea of "make each program do onw thing well". All experiment tracking, repro, etc. is out of scope imho, and should have been a separate program.

Hope my thoughts can be of some help :)

shortcipher3 commented 1 year ago

You mentioned merging the dvc.yaml with the artifacts.yaml, I would already like to break up my dvc.yaml. I like having them separate. Would like to better understand the value proposition of merging those if you could provide some context.

Like the others, I like the separation of concerns. I think using dvc as a component for gto makes a lot of sense. Perhaps I don't have the full context, but it seems like dvc shouldn't require gto, but gto could have a lot of additional functionality and features if it required dvc.

I can think of use cases where one might use dvc without gto, but not compelling cases for using gto without dvc.

I think treating gto as a dvc subcommand would be okay and if there are good enough reasons to merge the project then it would be worth considering. It would probably help with version compatibility while the projects continue to push toward stability.

Since it is open-source maybe at some point another party would want to add an option to use git lfs as an alternative to dvc, this would be easier to do if it were kept separate.

aguschin commented 1 year ago

Thanks @bgalvao @francesco086 @shortcipher3! That helps me a lot.

You mentioned merging the dvc.yaml with the artifacts.yaml, I would already like to break up my dvc.yaml. I like having them separate. Would like to better understand the value proposition of merging those if you could provide some context.

Sure! Mainly, two reasons I'm aware of:

we see that for some customers having too many files makes it hard to understand what happens. So the idea was to simplify onboarding experience (@dmpetrov mentioned this multiple times, so maybe he can give more details)
we have sections like stages, plots and metrics in dvc.yaml, so it makes sense to unify: have models, datasets and other artifact types there as well. And effectively, treat all those as artifact types you can register semantic versions for, or maybe assign stages to, if you need that.

maybe @shcheklein can also add something to the table.

As I see, there are 3 options now:

Keep GTO and DVC separate, as is now
Merge “artifacts.yaml” part of GTO in DVC. In this case GTO stays like a separate product, but it only manages Git tags, no artifacts.yaml management, no put/pick commands here
Merge “artifacts.yaml” and Git tags into DVC (i.e. merge GTO into DVC completely)

Last option would probably overcomplicate DVC too much... WDYT?

Also, we have many issues collected over time blocked by uncertainty of GTO/DVC integration.https://github.com/iterative/gto/issues/254. Sharing some thoughts about them below (feel free to add something if you have that in mind).

Some of them should be easier to implement in options 2&3, probably. Some will still require an integration between tags and artifacts.yaml (which means in option 2 we need to make external integration, in option 3 - internal integration). Listing most important here:

https://github.com/iterative/gto/issues/34 is implemented naturally in case of options 2&3
https://github.com/iterative/gto/issues/259 - instead of gto show --type model you'll run dvc models list as some replacement (although there is still question on how to get gto show --type model. Related to https://github.com/iterative/gto/issues/119 cc @tibor-mach)
https://github.com/iterative/gto/issues/198 - supporting monorepo naturally poke us into re-using DVC approach to that, justifying artifacts.yaml and dvc.yaml merge somewhat
https://github.com/iterative/gto/issues/158 - will require integration between artifacts.yaml and "Git tags" parts anyway (external in option 2, internal in option 3)
https://github.com/iterative/gto/issues/336 - maybe it's easier to implement if artifacts.yaml is inside of DVC

shortcipher3 commented 1 year ago

Regarding merging dvc.yaml and the artifacts.yaml file to reduce the number of required files to get onboarded, that makes sense to me. I can imagine teams with complicated setups wanting to break out their artifacts, plots, metrics, and/or stages into separate files. What about adding something like an include artifacts.yaml in the dvc.yaml spec?

I'm not seeing any killer features that are supported by 3 and not 2. I prefer 2 unless it blocks a great feature.

aguschin commented 1 year ago

Hi folks! We're drafting how this integration may look like - happy to hear your thoughts

https://github.com/iterative/mlem.ai/pull/323, inside there's a link to docs page

@shortcipher3 @francesco086 @bgalvao

aguschin commented 1 year ago

Hi @shortcipher3 @francesco086 @bgalvao! Happy to finally share that we finished with the update - I hope that'll be a smother experience now.

I recorded a brief video explaining the update, will make something better in a few days now. I also released https://github.com/iterative/gto/releases/tag/0.3.0 that deprecated reading/writing to artifacts.yaml and gto-action@v2 that now reads dvc.yaml for annotations instead of artifacts.yaml.

We'll be working on implementing dvc get . mymodel@latest and dvc get . mymodel#prod now to allow using models without involving GTO explicitly - that should make it easier.

Sorry for deprecating artifacts.yaml - to work with new MR, you'll need to move your annotations to dvc.yaml. I tried to implement it, but considering the big scope of the task overall, I had to eventually drop the support :( If you feel like you need it, please create a ticket in DVC repo and backlink to this discussion / share the importance of this so we can prioritize it right.

Overall, thanks for participating here, your feedback definitely made it way easier for me to understand the direction of changes!

aguschin commented 1 year ago

If you'll need to migrate from artifacts.yaml to dvc.yaml, I wrote a little helper script, you can find it in this section of docs, which explains new format: https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#artifacts

bgalvao commented 1 year ago

@aguschin great work! Love seeing the value-proposition of GTO making it into DVC :)

I have a question: if you declare models in the artifacts section of dvc.yaml, do you still declare them as outputs of a pipeline stage? For example:

artifacts:
  cv-classification: # artifact ID (name)
    path: models/resnet.pt  # same
    type: model
    desc: 'CV classification model, ResNet50'
    labels:
      - resnet50
      - classification
    meta:
      framework: pytorch

stages:
  train_resnet50:
    cmd: python -m src.model.train.resnet50
    outs:
      - models/resnet.pt  # same

aguschin commented 1 year ago

Thanks @bgalvao! Yes - adding artifacts: doesn't change what you have in stages or in other parts of dvc.yaml. Does it make sense?

bgalvao commented 1 year ago

it does :D

francesco086 commented 1 year ago

Awesome work as always @aguschin !

Thank you 🙏

francesco086 commented 1 year ago

We'll be working on implementing dvc get . mymodel@latest and dvc get . mymodel#prod now to allow using models without involving GTO explicitly - that should make it easier.

@aguschin which github issue can we follow to know about the progress on this?

aguschin commented 1 year ago

@francesco086, it's https://github.com/iterative/dvc/issues/9100#issuecomment-1559148246

Linking the comment that have a summary about all the issue, so you can skip reading other parts of it.

iterative / gto

DVC+GTO integration #337