iterative / gto

🏷️ Git Tag Ops. Turn your Git repository into Artifact Registry or Model Registry.
https://dvc.org/doc/gto
Apache License 2.0
140 stars 16 forks source link

Make `show` and other commands work with remote repos #25

Closed aguschin closed 1 year ago

aguschin commented 2 years ago

Right now they work on local repos only, that's a limitation of gitpython.

aguschin commented 2 years ago

Asked a question in scmrepo about this - maybe we can use it.

francesco086 commented 2 years ago

Hi, I'd like to contribute to this one. Any suggestion on where to start from?

aguschin commented 2 years ago

Thanks for asking!

There are few parts to this:

  1. read-only operations (show, describe, check-ref, history, stages). Specifically: 1a. read Git tags (check-ref; history and show without commits) 1b. read commits (describe) 1c. read both (show --all-commits, history, stages)
  2. write operations (assign, register, deprecate, annotate, remove). Specifically: 2a. create/remove git tag(s) and push (assign, register, deprecate) 2b. create commit and push (annotate, remove)

AFAIK, the only way to support this in full is to clone the repo locally to temp dir first. In this case, the ideal way would be to contribute to scmrepo in the issue above, I suppose, considering DVC already have something like this implemented.

There are also an option to support 1a partially. The thing is, you can read Git tags from a remote repo with git CLI like this: git ls-remote --tags https://github.com/iterative/example-gto. Didn't check if this is supported in GitPython. The justification for this approach is this PR - i.e. all artifacts that don't have Git tags are considered unregistered and are optional to show in gto show.

This partial option is good for gto show command without --all-commits and --all-branches flags, and won't enable gto describe, gto history, etc. Because it should be much quicker than fully cloning repo, eventually we'll need to support both I suppose.

aguschin commented 2 years ago

Found a similar explanation I've posted in another issue earlier https://github.com/iterative/gto/issues/220#issuecomment-1198236090

francesco086 commented 2 years ago

Thanks @aguschin ! I would surely start from the read-only operation.

Reading through what you sent me, I found this especially illuminating: https://stackoverflow.com/a/1179728/19782654 git is distributed, however often times (always?) there is a reference remote repo. One could choose to rely on those and use GitLab/GitHub APIs. Otherwise there is no way around, one need to do a git clone. For tags only, one could also rely on git ls-remote as you suggested.

I think going for a git clone in a tmp folder is probably the most general (it is a necessary step for write functionalities too) and robust solution. Do you agree? Shall we go for that straight away? I am new to scmrepo, in case git clone functionalities should be part of this repo? or GitPython?

francesco086 commented 2 years ago

Related issue on scmrepo: https://github.com/iterative/scmrepo/issues/41

aguschin commented 2 years ago

I think going for a git clone in a tmp folder is probably the most general (it is a necessary step for write functionalities too) and robust solution. Do you agree? Shall we go for that straight away?

Sure, we can start with this, and keep git ls-remote as a future optimization.

I think @pmrowla summarized it well in this comment. GitPython is a library that call git cli and provide some API over it (the others are dulwich or pygit2). scmrepo in turn wraps GitPython, Dulwich and pygit2, providing a single interface to them, IIUC.

Initially I used GitPython in GTO because scmrepo didn't have needed functionality at that time, IIRC (i.e. iterating over tags and reading their info). Now we need to make a temporary repo clone, and though we could implement this in GTO again, DVC already have this implemented, so we could take it and change. A better approach would be to move that functionality from DVC to scmrepo, and then reuse by both GTO and DVC.

Maybe @pmrowla could give us more guidance on this (in the scmrepo issue)?

francesco086 commented 2 years ago

Two questions:

  1. where should we store the git cloned repo? I initially thought in .gto/cache, but .gto is used for the configuration. Would it make sense to move the current gto config inside .gto/config? Then we could have a cache folder in .gto/.

  2. I would start from the gto show command, if that's ok, as I think is the most useful. But perhaps there are reasons to start from another one?

aguschin commented 2 years ago
  1. Let's start it simple with cloning to temp dir. For handling the temp dirs (creating and deleting them) you can use tempfile.TemporaryDirectory() from https://docs.python.org/3/library/tempfile.html#examples This is a simple approach that will do for the first time (even DVC uses it, @pmrowla please correct me if I'm wrong). It maybe won't allow us to cache things in between CLI calls (like if you're going to call GTO multiple times on the same repo), but once we implement the simple approach, we may consider the approaches to optimize this. On the bright side, this doesn't require us to move .gto to .gto/config, making it easier for now in terms of compatibility.
  2. Yeah, please start with whichever you like more. gto show sounds like the most useful to me as well. In any case, once you implement this for a single command, repeating it for others should be a trivial task.
francesco086 commented 2 years ago

Ok, I thought of tmpfiles, it’s just that from direct experience I know that each call will be slow and therefore it will be a little cumbersome. But you are right, let’s start like this, optimization comes later 😄

Thanks for your input!

On Fri 9. Sep 2022 at 06:49 Alexander Guschin @.***> wrote:

  1. Let's start it simple with cloning to temp dir. For handling the temp dirs (creating and deleting them) you can use https://docs.python.org/3/library/tempfile.html This is a simple approach that will do for the first time (even DVC uses it, @pmrowla https://github.com/pmrowla please correct me if I'm wrong). It maybe won't allow us to cache things in between CLI calls (like if you're going to call GTO multiple times on the same repo), but once we implement the simple approach, we may consider the approaches to optimize this. On the bright side, this doesn't require us to move .gto to .gto/config, making it easier for now in terms of compatibility.
  2. Yeah, please start with whichever you like more. gto show sounds like the most useful to me as well. In any case, once you implement this for a single command, repeating it for others should be a trivial task.

— Reply to this email directly, view it on GitHub https://github.com/iterative/gto/issues/25#issuecomment-1241496024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7PWYGUEEHGO4XYRS3LZJLV5K6XHANCNFSM5O6Y5KIQ . You are receiving this because you commented.Message ID: @.***>

pmrowla commented 2 years ago

Temp directories (without any persistent caching between CLI calls) is what DVC does: https://github.com/iterative/dvc/blob/main/dvc/external_repo.py

francesco086 commented 2 years ago

I have been working a bit on this (https://github.com/francesco086/gto/pull/1).

I have one design decision to make: who is responsible to clean up the git cloned directory? What should be the mechanism? Currently I am using a TemporaryDirectory, and in the tests I am calling TemporaryDirectory.cleanup(). How should the cleanup be called after that the repo has been "initialized" as in e.g. https://github.com/iterative/gto/blob/main/gto/registry.py#L62 ? Any suggestion?

aguschin commented 2 years ago

One option that may work (looks like a hacky one to me) is to use with ... yield construction. Something like (raw draft) :

def return_repo(path):
    if remote_url(path):
        with TemporaryDirectory() as tmpdir:
            clone_repo(path, tmpdir)
            yield tmpdir
    return path

class GitRegistry(BaseModel):
    ...
    @classmethod
    def from_repo(cls, repo=Union[str, Repo], config: RegistryConfig = None):
        if isinstance(repo, str):
            try:
                for repo in return_repo(repo): pass
                repo = git.Repo(repo, search_parent_directories=True)
            except (InvalidGitRepositoryError, NoSuchPathError) as e:
                raise NoRepo(repo) from e
        if config is None:
            config = read_registry_config(
                os.path.join(repo.working_dir, CONFIG_FILE_NAME)
            )

As per Stack Overflow answer, it should get garbage collected after process exit. Which is fine for us, since GitRegistry can be used for all the time process exists.

I'll search for more beautiful solution, but this may suffice :)

francesco086 commented 2 years ago

Thanks for the suggestion, it indeed looks very hacky... reading more in detail from the stackoverflow discussion you mentioned it sounds to me quite a bad idea to follow this approach. The directory could be deleted at any time (it's undeterministic). I don't even want to think about what kind of headaches one could get later on... I strongly advise to go another way.

From my side I see 2 options:

  1. go back to the idea of a cache folder (for now we could simply name it ".gto_remote_repos", to avoid clashes with the .gto config file)
  2. refactor quite a lot in gto and enclose all the command execution within a with statement, to ensure that the tmp directory will be deleted at the end

Of course I am open to discuss this :)

aguschin commented 2 years ago

go back to the idea of a cache folder (for now we could simply name it ".gto_remote_repos", to avoid clashes with the .gto config file)

Sure, we can try it if it looks easier. Two questions:

  1. Could we get it updated each time, and remove everything that doesn't exist in the remote? (I assume yes, but we need to check if let's say someone deleted few tags/commits in remote, we do the same on the local copy once your instantiate GitRegistr, etc).
  2. I assume there should be the user-friendly way to cleanup the cache. E.g. $ gto gc or maybe there is something better)

Re refactoring and the 2nd option, I think we could do something like this (thanks @mike0sv for the discussion and the code snippet)

class GitRegistry:
    ...
    @classmethod
    @contextmanager
    def from_path(cls, path):
        repo = GitRegistry(path)
        yield repo
        if repo.needs_cleanup():
            repo.cleanup()

    @needs_clone
    def ls(self):
        ...

    def needs_cleanup(self):
        return self.is_remote and self._local_path is not None

def needs_clone(f):
    def inner(self, *args, **kwargs):
        if self._local_path is None:
            self._local_path = self._clone_repo()
        return f(self, *args, **kwargs)
    return inner

Then all API calls would have to look like:

with GitRegistry.from_repo(path) as repo:
   gto.api.register(repo, ...)

In will require changing dozen of lines in gto/api.py and few lines in tests.

I'll think about this tomorrow morning, but for now the 1st option with cache looks as it have a big pro (cache) and simpler (although it gives you less control over your cache).

Btw, this also reminded me that @iterative/studio-backend team wanted to have a way to cache self.get_state() calls in the GitRegistry functions to parse Git repos faster, but that's a different story IMO. (We could get that done by having some "lazy" flag that would make GitRegistry to not update state and reuse the existing one unless it's asked directly).

UPD: the needs_clone decorator above clones repo if it's needed for the certain method to run. In some cases we may just run git ls-remote --tags, get tags and not clone the repo at all.

francesco086 commented 2 years ago

Option 1. means we are building a stateful solution. Option 2. is stateless.

  1. can lead to higher performance but can get complex to handle (e.g. as you mentioned, how do you clean the cache? But also, do we want to do a git pull if the repo is already in the cache, how and when?)

  2. is simpler from a user perspective but performance-wise here it could get annoying (git clone can be substantially slower than a git pull). But perhaps with some tricks we could get good performance, e.g. with shallow clones.

I leave the decision to you @aguschin, this is pretty much a fundamental design choice for gto :) Then we can see the details

aguschin commented 2 years ago

I can easily imagine someone who is trying to run many commands on a remote repo, so I definitely would prefer an approach with cache. I would assume, even if you want to register/assign, you would first run gto show to recall the exact artifact name. Then you could run gto show modelname to find out the specific version you need to assign a stage to, etc. Cloning a repo each time can make things 2-3x slower in this case.

do we want to do a git pull if the repo is already in the cache, how and when

I would assume we need to do that each time GitRegistry is initialized. Or even each time we re-calculate self.get_state().

The other problem here could be if anyone wants to run some commands in parallel on the same remote repo. E.g. I run two annotate - one for main branch, the other for dev branch - they may clash cause your create a commits in both cases and (I assume) git will need to checkout a branch to create a commit. For now we can do some kind of locking mechanism, so one process will wait until the other is completed. Here I think we need to check what commands run in parallell are prone to this (I assume this doesn't affect reading operations, like show, history, describe. Also register and assign should be unaffected). And since we consider moving artifacts.yaml to dvc.yaml, annotate may not be a problem in future at all.

aguschin commented 2 years ago

For the sake of simplicity of the first implementation we could even use an external context manager and a decorator to clone the repo

# gto/api.py
def clone_repo_if_needed(f):
    def inner(*args, **kwargs):
        repo = args[0]
        if is_remote(repo):
            with TemporaryDirectory() as tmpdir:
                clone_repo(tmpdir)
                return f(tmpdir, *args[1:], **kwargs)
        return f(repo, *args[1:], **kwargs)
    return inner

@clone_repo_if_needed
def method(repo, ...):
    ...

That won't require any changes in GitRegistry, creates a cloned repo only until the API function is called... WDYT? If that works, we can implement it first, and then improve with cache on top of it.

UPD: updated the snippet

mike0sv commented 2 years ago

I'd say stateless approach (option 2) is preferable if we can optimize the performance. Let's see what we actually need from temporary cloned repo.

There is a way to only pull certain files (or maybe even none of them), see this

francesco086 commented 2 years ago

:) just to throw more ideas (and confusion) in: in the past, when I had a hard time between stateless and stateful, joblib Memory came to the rescue. It allows to design the application as stateless, but brings the performance advantages of stateful applications. However, I am not sure if here it can be helpful.

@aguschin I like very much the decorator idea, good way to close the with statement without being intrusive in the code :)

francesco086 commented 2 years ago

@mike0sv you are right, one can fetch only what is strictly necessary. However, from my limited knowledge of the code, it seems to me that this could lead to quite many changes in the pre-existing code.

Shall I try to go this way? stateless, using a tmp directory as suggested by @aguschin?

aguschin commented 2 years ago

@mike0sv thanks for the suggestion. Yes, I believe we need to get (1) list of tags, (2) artifacts.yaml from all the commits. I tried the answer from the stack overflow you posted and looks like it works, partially (still additionally cloned README and requirements.txt for me, but at least didn't clone models/ and .github/ folders ):

git clone \
  --filter=blob:none  \
  --sparse \
  https://github.com/iterative/example-gto \
;
cd example-gto
git sparse-checkout set artifacts.yaml --skip-checks

@francesco086, let's try the decorator idea first then) We can also keep @mike0sv suggestion as a future improvement, if it can't be easily implemented for some reason (like you can't use gitpython and need to call subprocess for it).

francesco086 commented 2 years ago

For today I managed to implement the decorator suggested by @aguschin with few touches: https://github.com/francesco086/gto/blob/03c5622ed5b5039b95b4e626030932a033a3e9ba/gto/git_utils.py#L12

Tomorrow I will try to move it to the show command (first I need to figure out how to write the tests for it).

Perhaps you can help me with couple of things:

  1. do you know of a more professional way to check if a string is a valid url for a git repo? I would not bet anything on my solution in https://github.com/francesco086/gto/blob/03c5622ed5b5039b95b4e626030932a033a3e9ba/gto/git_utils.py#L43
  2. For whatever reason the PyCharm debugger does not seem to work with the gto project. Do you experienced the same? Any hint?
aguschin commented 2 years ago
  1. Please take a look at this https://stackoverflow.com/a/22312124/4890419 , I assume it should work with slight modifications maybe (like if there is no .git in the end of the URL).
  2. Hmm, I don't use the PyCharm and PyCharm debugger specifically. I know @mike0sv does. @mike0sv, maybe you can give some hint?
  3. Re testing I think you can test by wrapping some dummy function with returns path to dir and checking that the dir indeed was created (you can clone example-gto or gto repo itself I think).
francesco086 commented 2 years ago
  1. Please take a look at this https://stackoverflow.com/a/22312124/4890419 , I assume it should work with slight modifications maybe (like if there is no .git in the end of the URL).

Thanks! I noticed this, but I suppose it is not a major problem, right?

aguschin commented 2 years ago

it is not a major problem

I assume yes, for some time at least)

francesco086 commented 2 years ago

PR ready :)

I tried it on my production artifact registry for fun, it works like a charm :)

aguschin commented 2 years ago

Cool! Checking it out!

francesco086 commented 2 years ago

Now I think I can create a PR for check-ref, history and stages :) on it!

francesco086 commented 2 years ago

I have just noticed a problem - I changed the info about the --repo option (the text you get from gto show -h for example) to Repository to use (remote repos accepted).

I did not notice that the change would have propagate to all other commands XD So, as for the code in the master, all gto commands accept remote repos :D

I will change it back with the next PR, ok? Perhaps we can first make the changes on all commands and then update the help message.

francesco086 commented 2 years ago

Here the new PR: https://github.com/iterative/gto/pull/273

This time should be totally smooth :)

aguschin commented 2 years ago

Good stuff! I guess now we could make a release with this! I assume implementing write operations is a heavier stuff, that first need implementing some helper options that will push the tags/commits for locally cloned repos. Then for remote repos it should be trivial to add this.

francesco086 commented 2 years ago

Yup! But I think it won't take too long now that I start to get the hang on it...

francesco086 commented 2 years ago

Forgot one last read command: https://github.com/iterative/gto/pull/275

aguschin commented 2 years ago

@francesco086, for the upcoming PRs, when (and if) you do them: would be good to add a couple sentences in the README about the --repo now accepting remote repositories and post an example. That should increase the visibility of this feature.

francesco086 commented 2 years ago

Hi @aguschin ! I was planning to do it together with the change in the cli help, after also the write functionalities are in place.

I am planning to go fast on the development 😁 Just today and tomorrow I had/have some deadlines to catch…

Actually perhaps I better ask already about a design choice. For the writing functionalities there is the little problem of the time gap between git clone and commit push. In this time gap someone/something else may have pushed something else and therefore the push may fail. At this point in the development you may find this ridiculous (what is the chance of this to happen?), but actually I already encountered these errors in my production pipelines using my scripts. I am planning to ignore this concurrency issue on a first round, if that’s alright with you. By ignoring this I am quite confident that next week I will be able to have the writing functionalities in place ;) Then we can change help and readme.

And then we can address the concurrency issue and optimize the git cloning.

On Thu 22. Sep 2022 at 13:17 Alexander Guschin @.***> wrote:

@francesco086 https://github.com/francesco086, for the upcoming PRs, when (and if) you do them: would be good to add a couple sentences in the README about the --repo now accepting remote repositories and post an example. That should increase the visibility of this feature.

— Reply to this email directly, view it on GitHub https://github.com/iterative/gto/issues/25#issuecomment-1254881890, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7PWYDPTKHT3PKPDEYUFGDV7Q557ANCNFSM5O6Y5KIQ . You are receiving this because you were mentioned.Message ID: @.***>

aguschin commented 2 years ago

Good point about concurrency! Yes, let's ignore it on the first round I think.

francesco086 commented 2 years ago

I started from register. I just noticed that (as it should be) currently it only creates a tag. Does this makes sense for a remote repo? I mean, I suppose that in that case I would also like to commit and push some artifacts, right...?

Should we, for now, leave gto behaves like this also for remote repos? Perhaps in future one could introduce new functionalities...?

francesco086 commented 2 years ago

I have a first version that works (I checked on my model registry), however:

  1. When you run gto register --repo URL NAME the command outputs:

    Created git tag 'NAME@vX.Y.Z' that registers version
    To push the changes upstream, run:
    git push NAME@vX.Y.Z

    I tried to amend this by making the code echo below it:

    As the repo is remote, we executed the command `push --tags` for you

    This is sub-optimal, but for now I thought it was ok. I don;t know how to stop from echoing the first message without making a mess in the code...

  2. I run push --tags. Is it ok, or should I try to run git push NAME@vX.Y.Z?

  3. I currently implemented it as part of the usual decorator git_clone_remote_repo, where now one can specify push_tags=True. On the one hand this is cool, because I pretty much didn't modify the api code. However, I was thinking that perhaps all writing functionalities, even when working on a local repo but with a remote, could benefit from it. One could add the option --auto-push too all of them, and then one could push the created tags. What do you think?

  4. I registered for the HacktoberFest, is it ok if I create the PR/PRs next week? :P In case you want to have a look at the code already, it's here

aguschin commented 2 years ago

I don;t know how to stop from echoing the first message without making a mess in the code...

I'll take a look sometime next week. Please ping me if you need help with this any time sooner)

I run push --tags. Is it ok, or should I try to run git push NAME@vX.Y.Z?

No, you shouldn't. In fact, it's the wrong message. It should be git push origin NAME@vX.Y.Z.

The only problem I see here is: if you clone a repo, but meanwhile someone deleted an unrelated tag, then git push --tags will not only push your new tag, but also the one that was deleted. To handle this case it should be rather git push origin $TAGNAME.

One could add the option --auto-push too all of them, and then one could push the created tags

Yes, that would be a good option to handle this IMO. Citing @omesser feedback here.

I registered for the HacktoberFest, is it ok if I create the PR/PRs next week? :P

Sure, np! TBH, I wanted to mark this issue as hactoberfest, but then decided to skip it, cause you already doing it. Let me add that label now then :)

francesco086 commented 2 years ago

Yes, that would be a good option to handle this IMO

That's good to hear! Then I will modify what I have done in favor of this.

francesco086 commented 1 year ago

Here the PR: https://github.com/iterative/gto/pull/281

if that is ok, then adding the support to remote repos is trivial :)

francesco086 commented 1 year ago

Here the next PR to add support for remote repos: https://github.com/iterative/gto/pull/286

francesco086 commented 1 year ago

Mmm... strange, why didn't https://github.com/iterative/gto/pull/286 appear here as the #281?

I'm asking for my Hacktoberfest goal of 4 successful PRs :P

Any idea @aguschin ? Can you help? Can it be because the Project was not set? These are the only differences I can see

francesco086 commented 1 year ago

Besides that, now next target in the radar are annotate and remove. In this case it will probably means updating the yml file, git commit, and git push, right?

I would again take the approach of first creating the option --auto-push, is that ok?

francesco086 commented 1 year ago

I would again take the approach of first creating the option --auto-push, is that ok?

Mmm... on a second thought, one may want to choose between --auto-commit and --auto-push ...?

aguschin commented 1 year ago

Any idea

Better now? I mentioned this issue in the PR and it appeared just above ^

Besides that, now next target in the radar are annotate and remove. In this case it will probably means updating the yml file, git commit, and git push, right?

Yes! But please consider that artifacts.yaml functionality can move to dvc (this is going to be decided not faster than in a couple of months though). I would like this to be implemented anyway since there is a change it will stay in GTO, but don't want to do that without realizing this.

Mmm... on a second thought, one may want to choose between --auto-commit and --auto-push ...?

Interesting idea! Indeed, one may want to commit the change, but wait some time before pushing it. May be handy to implement both :)

aguschin commented 1 year ago

Having --auto-push only is totally OK for me also. We could add --auto-commit once it's requested from anyone.

francesco086 commented 1 year ago

Thanks for the help and feedback @aguschin

To push one first need to commit, and this the part that will require more work I am afraid. So I would first target adding the --auto-commit option for the commands annotate and remove.

I think it is tricky because one wants to commit only the changed file (currently artifacts.yaml) and otherwise leave the git status unchanged. In case other files have been added to staging, one should first unstage them, add the file artifacts.yaml, commit, then add again the files that have been unstaged. Or do you have a suggestion on how to achieve this result in an easier way?

aguschin commented 1 year ago

@francesco086

Got it. I think git stash + git stash pop should work. That what we do in dvc exp, if I'm not mistaken (@skshetry, could you please clarify this?)

E.g. image

The question I have: what if there were some uncommitted changes in artifacts.yaml already? Let's suppose you run

$ gto annotate mymodel --type model
$ gto annotate mymodel --path X --auto-commit

now you probably want to get both path and type in artifacts.yaml. But with git stash ... git stash pop you'll lose it. 🤔 Should we just error out on the 2nd command then? ("You have uncommitted changes in artifacts.yaml, please commit them before running --auto-commit"?)