iterative / scmrepo

SCM wrapper and fsspec filesystem for Git for use in DVC.
https://dvc.org
Apache License 2.0
21 stars 14 forks source link

support scp-style shorthand urls with users other than git@ #346

Closed skshetry closed 8 months ago

skshetry commented 8 months ago

See discussion: https://discuss.dvc.org/t/dvc-import-fails-with-git-failed-to-fetch-ref-from/1997

scp shorthand URL format is as follows:

[user@]host.domain.com:path/to/resource

So, user can be anything, and is optional.

[!NOTE] Does not work with absolute paths in scp-style URLs, eg: git@github.com:/path/to/resource.

codecov-commenter commented 8 months ago

Codecov Report

Attention: Patch coverage is 55.88235% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 77.95%. Comparing base (1096899) to head (d7c917a).

Files Patch % Lines
src/scmrepo/git/lfs/client.py 18.75% 12 Missing and 1 partial :warning:
src/scmrepo/git/backend/pygit2/__init__.py 66.66% 1 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #346 +/- ## ========================================== - Coverage 78.78% 77.95% -0.83% ========================================== Files 39 41 +2 Lines 5156 5176 +20 Branches 931 934 +3 ========================================== - Hits 4062 4035 -27 - Misses 930 965 +35 - Partials 164 176 +12 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

shcheklein commented 8 months ago

thanks @skshetry !

sisp commented 8 months ago

I am not sure how lfs uses the user, but it would be nice to remove the assumption about the user from scp-style shorthand URLs.

@skshetry LFS uses the user to run

ssh <user>@<host> git-lfs-authenticate <project_path>.git download|upload

to retrieve credentials for making requests to the batch API. I don't see any reason why the user would have to be git, I've never seen another one though, but it seems it does happen. I think we can relax the SSH URL regex to support arbitrary users.

sisp commented 8 months ago

The current SSH URL regex doesn't support a leading / in the URL path when the host and path are separated by a :. I'll need to check whether GitLab and GitHub allow for the leading / in their SSH URLs.

skshetry commented 8 months ago

@skshetry LFS uses the user to run

ssh <user>@<host> git-lfs-authenticate <project_path>.git download|upload

to retrieve credentials for making requests to the batch API. I don't see any reason why the user would have to be git, I've never seen another one though, but it seems it does happen. I think we can relax the LFS SSH URL regex to support arbitrary users.

We convert it to http url, right? So, what is user going to be used for? Also note that the user is optional.

The current LFS SSH URL regex doesn't support a leading / in the URL path when the host and path are separated by a :. I'll need to check whether GitLab and GitHub allow for the leading / in their SSH URLs.

The resource path could be anything. If it's a leading slash, it's an absolute path, otherwise it's a relative path to the user's directory.

sisp commented 8 months ago

We convert it to http url, right? So, what is user going to be used for?

Well, we infer the LFS API URL from the Git URL, so there the user is irrelevant. But we need the user for the

ssh <user>@<host> git-lfs-authenticate <project_path>.git download|upload

command. But there what matters is the right user is used, and it should be same user used for cloning via SSH. So I think it's fine/correct to use any (i.e. the provided) user and not only git.

I've tested absolute and relative paths on GitHub and GitLab, both work for cloning via SSH URLs and for running git-lfs-authenticate. Would you like (me?) to refactor the _SSHLFSClient implementation to use the new SCP regex?

skshetry commented 8 months ago

Thank you. I think I understand it a bit more.

I think the check like follows are enough then, right?

if url.startswith("ssh://") or is_scp_style_url(url):
  ...

From what I understood, we can get rid of _URL_PATTERN, right? If that's enough, let me know and I can do the changes. Otherwise, feel free to push in the pull request.

sisp commented 8 months ago

The check and removal of _URL_PATTERN look good. But there's more – roughly this:

  1. Extracting information from the SSH URL must be adapted to use SCP_REGEX:

    https://github.com/iterative/scmrepo/blob/1096899158e9d195b4f6fcae9d320ebdbaf0912e/src/scmrepo/git/lfs/client.py#L236-L239

    SCP_REGEX doesn't capture a port, so port is not matched anymore; user is matched now; SCP_REGEX doesn't use named capture groups, so capture group matches must be extracted differently (watch out for the "user" capture group including a trailing @, which needs to be stripped – or we adapt SCP_REGEX).

  2. _SSHLFSClient's constructor must accept an optional user argument (should default to None), and port must be removed (class instance attributes must be adapted accordingly as well as their uses in _get_auth_header and _git_lfs_authenticate).

  3. _git_lfs_authenticate must accept a user argument, and port must be removed. The user keyword argument in the self._ssh.run_command(...) call must receive the user argument.

sisp commented 8 months ago

If you'd prefer I refactor the LFS code, feel free to assign the PR to me, @skshetry.

skshetry commented 8 months ago

If you'd prefer I refactor the LFS code, feel free to assign the PR to me, @skshetry.

Hi, I pushed aa1ce61. Let me know what you think. :)

skshetry commented 8 months ago

Thank you @sisp for the reviews. 🙂

sisp commented 8 months ago

No problem, @skshetry. 🙂