Closed aguschin closed 1 year ago
My perspective is that of someone who never used GTO without DVC, and also that of Unix philosophy:
Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".
Thus, I would say, let DVC do the data management and let GTO stay dedicated to managing git tags (for models). This is not to say that GTO cannot become merged into DVC as a subcommand - I imagine as dvc gto
or something similar.
Nonetheless, the separation of concerns would be there.
One point in favour of this separation of concerns is that DVC has a lot of stability and contribution behind artifact management - artifact as in binaries, data, etc. From a development point of view, I would rather leverage the foundation of DVC rather than risk reinventing the wheel DVC has built - but I am not a code contributor to any of these projects so take that with a grain of salt.
Second, would be visibility of the gto functionality: DVC has a bigger name. I mention this because I really want to see Git Tag Ops become a standard practice.
For me, only 3. is a real use-case.
As for 4., I don't know how DVC would handle potentially different remotes dissociated from git repositories, so I have no real take on that.
Cheers ✌🏼
Hi @aguschin thanks for asking my opinion, I am happy to share my thoughts on this.
I feel very much in line with @bgalvao concerning the Unix philosophy. However, the conclusions I come to are probably a bit different.
What isgto
goal? Artifact registry or just manage git tags?
If it is the second one then it has no right to stand on its own, integrate it with dvc
(but allow me to fork first :P). I think its purpose should be the first one. But then it misses some functionalities, e.g. put
and pick
as in https://github.com/iterative/gto/issues/307.
What I think would be the right way to follow the Unix philosophy is to create an interface in gto to integrate other tools, like dvc. Allow users to write their own gto-plugins. And provide dvc
as an example plugin.
If we assume new gto commands like in https://github.com/iterative/gto/issues/307, I can imagine an interface along these lines:
class GTOIntegration(ABC):
def pre_put():
pass
def after_put():
pass
def pre_pick():
pass
def after_pick():
pass
Btw, git
commands could also be integrated in this way.
Then, one can call gto
and specify the plugin with an option, e.g. gto put --plugin dvc
(--plugin
default value will be git
, which would mean commit/push the file as-it-is).
I should also mention that I don't like how many things does dvc
do right now, it's way too much, and breaks the idea of "make each program do onw thing well". All experiment tracking, repro, etc. is out of scope imho, and should have been a separate program.
Hope my thoughts can be of some help :)
You mentioned merging the dvc.yaml
with the artifacts.yaml
, I would already like to break up my dvc.yaml
. I like having them separate. Would like to better understand the value proposition of merging those if you could provide some context.
Like the others, I like the separation of concerns. I think using dvc as a component for gto
makes a lot of sense. Perhaps I don't have the full context, but it seems like dvc
shouldn't require gto
, but gto
could have a lot of additional functionality and features if it required dvc
.
I can think of use cases where one might use dvc
without gto
, but not compelling cases for using gto
without dvc
.
I think treating gto
as a dvc
subcommand would be okay and if there are good enough reasons to merge the project then it would be worth considering. It would probably help with version compatibility while the projects continue to push toward stability.
Since it is open-source maybe at some point another party would want to add an option to use git lfs as an alternative to dvc, this would be easier to do if it were kept separate.
Thanks @bgalvao @francesco086 @shortcipher3! That helps me a lot.
You mentioned merging the
dvc.yaml
with theartifacts.yaml
, I would already like to break up mydvc.yaml
. I like having them separate. Would like to better understand the value proposition of merging those if you could provide some context.
Sure! Mainly, two reasons I'm aware of:
stages
, plots
and metrics
in dvc.yaml
, so it makes sense to unify: have models
, datasets
and other artifact types there as well. And effectively, treat all those as artifact types you can register semantic versions for, or maybe assign stages to, if you need that.maybe @shcheklein can also add something to the table.
As I see, there are 3 options now:
artifacts.yaml
management, no put
/pick
commands hereLast option would probably overcomplicate DVC too much... WDYT?
Also, we have many issues collected over time blocked by uncertainty of GTO/DVC integration.https://github.com/iterative/gto/issues/254. Sharing some thoughts about them below (feel free to add something if you have that in mind).
Some of them should be easier to implement in options 2&3, probably. Some will still require an integration between tags and artifacts.yaml
(which means in option 2 we need to make external integration, in option 3 - internal integration). Listing most important here:
gto show --type model
you'll run dvc models list
as some replacement (although there is still question on how to get gto show --type model
. Related to https://github.com/iterative/gto/issues/119 cc @tibor-mach)artifacts.yaml
and dvc.yaml
merge somewhatartifacts.yaml
and "Git tags" parts anyway (external in option 2, internal in option 3)artifacts.yaml
is inside of DVCRegarding merging dvc.yaml
and the artifacts.yaml
file to reduce the number of required files to get onboarded, that makes sense to me. I can imagine teams with complicated setups wanting to break out their artifacts, plots, metrics, and/or stages into separate files. What about adding something like an include artifacts.yaml
in the dvc.yaml
spec?
I'm not seeing any killer features that are supported by 3 and not 2. I prefer 2 unless it blocks a great feature.
Hi folks! We're drafting how this integration may look like - happy to hear your thoughts
https://github.com/iterative/mlem.ai/pull/323, inside there's a link to docs page
@shortcipher3 @francesco086 @bgalvao
Hi @shortcipher3 @francesco086 @bgalvao! Happy to finally share that we finished with the update - I hope that'll be a smother experience now.
I recorded a brief video explaining the update, will make something better in a few days now.
I also released https://github.com/iterative/gto/releases/tag/0.3.0 that deprecated reading/writing to artifacts.yaml
and gto-action@v2
that now reads dvc.yaml
for annotations instead of artifacts.yaml
.
We'll be working on implementing dvc get . mymodel@latest
and dvc get . mymodel#prod
now to allow using models without involving GTO explicitly - that should make it easier.
Sorry for deprecating artifacts.yaml
- to work with new MR, you'll need to move your annotations to dvc.yaml
. I tried to implement it, but considering the big scope of the task overall, I had to eventually drop the support :( If you feel like you need it, please create a ticket in DVC repo and backlink to this discussion / share the importance of this so we can prioritize it right.
Overall, thanks for participating here, your feedback definitely made it way easier for me to understand the direction of changes!
If you'll need to migrate from artifacts.yaml
to dvc.yaml
, I wrote a little helper script, you can find it in this section of docs, which explains new format: https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#artifacts
@aguschin great work! Love seeing the value-proposition of GTO making it into DVC :)
I have a question: if you declare models in the artifacts section of dvc.yaml, do you still declare them as outputs of a pipeline stage? For example:
artifacts:
cv-classification: # artifact ID (name)
path: models/resnet.pt # same
type: model
desc: 'CV classification model, ResNet50'
labels:
- resnet50
- classification
meta:
framework: pytorch
stages:
train_resnet50:
cmd: python -m src.model.train.resnet50
outs:
- models/resnet.pt # same
Thanks @bgalvao! Yes - adding artifacts:
doesn't change what you have in stages or in other parts of dvc.yaml
. Does it make sense?
it does :D
Awesome work as always @aguschin !
Thank you 🙏
We'll be working on implementing
dvc get . mymodel@latest
anddvc get . mymodel#prod
now to allow using models without involving GTO explicitly - that should make it easier.
@aguschin which github issue can we follow to know about the progress on this?
@francesco086, it's https://github.com/iterative/dvc/issues/9100#issuecomment-1559148246
Linking the comment that have a summary about all the issue, so you can skip reading other parts of it.
Just had a discussion with @dberenbaum yesterday about this. Overall, feels like there are too many issues blocked by missing GTO/DVC integration. At the same time, I feel like GTO is not useful without DVC and is more like a plugin to DVC.
So, two options are possible: build an integration in GTO, or just merge GTO into DVC.
There are not many decent ways to actually store binaries in repo, which makes it hard to imagine using GTO without DVC in the real-world scenarios. I can recall these options:
Supporting option 4 in GTO to have
upload
/download
functionality as in https://github.com/iterative/gto/issues/307 will require integrations with each place (for s3 it would be fsspec probably, for artifactory some python client of theirs, etc). It makes me think these integrations could be also part of DVC as well (dvc import-url
, etc).Now if we can't imagine a good use-case for GTO without DVC, or all those use-cases would require some machinery that could also be part of DVC, why not just merge GTO into DVC?
Asking your opinion @francesco086 @shortcipher3 @bgalvao since you're the most active GTO users I know about :)
(this is not something we need to decide right away, since we can just build this integration inside of GTO, and then merge - but I think merging could make it more straightforward. I'm just collecting opinions for now to have more detailed picture).