iterative / scmrepo

SCM wrapper and fsspec filesystem for Git for use in DVC.
https://dvc.org
Apache License 2.0
21 stars 13 forks source link

support `http.extraheader` #262

Open pmrowla opened 1 year ago

pmrowla commented 1 year ago

https://github.com/jelmer/dulwich/issues/882

in libgit2/pygit2 this requires exposing git_fetch_options.extra_headers in the pygit remote callbacks

dberenbaum commented 1 year ago

@pmrowla What is the level of effort to support http.extraheader? Thinking about this some more, it would be nice to have that support because I think we increasingly have valid CI use cases that don't require CML at all, and avoiding both cml ci and iterative/setup-cml would be preferable.

pmrowla commented 1 year ago

@dberenbaum I think it would probably take me about a full sprint. One thing to note here is that the git http. config options also support a lot of ways to do url substitution which also get used in a lot of the CI platforms and I'm not sure if any of them are actually supported in either dulwich or libgit2

dberenbaum commented 1 year ago

Discussed that this isn't a priority for now given the level of effort and ability to handle this in iterative/setup-cml and/or iterative/setup-dvc (and in cml ci)

shcheklein commented 1 year ago

For the record we hit the same issue with gto here: https://github.com/iterative/gto/issues/277

Also, AFAIU cml ci requires REPO_TOKEN to be set which is an extra items - security issues, and overall requires some setup.

dberenbaum commented 1 year ago

@shcheklein Can we recommend to include setup-dvc and/or setup-cml when using our tools inside GHA?

dberenbaum commented 1 year ago

@dacbd Do you know what exactly are the requirements to include in a GHA workflow in order to push tags via gto?

shcheklein commented 1 year ago

@shcheklein Can we recommend to include setup-dvc and/or setup-cml when using our tools inside GHA?

we probably can. But it also means we'll need to do the same with gto for example which becomes a bit heavy and unexpected (at least by its name - setup-cml, setup-dvc).

Also, by biggest concern that it requires and extra token to be set - REPO_TOKEN AFAIU.

dberenbaum commented 1 year ago

we probably can. But it also means we'll need to do the same with gto for example which becomes a bit heavy and unexpected (at least by its name - setup-cml, setup-dvc).

I can't find it ATM, but we had some discussion about whether we should consolidate to use a single action everywhere like setup-dvc even when using cml, gto, etc., or to have a setup-gto that would take care of this.

Also, by biggest concern that it requires and extra token to be set - REPO_TOKEN AFAIU.

@dacbd To confirm, is it required to include this?

        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}

@shcheklein Is your main concern about including this or managing the secret in GH? I think it only uses the auto-generated secrets.GITHUB_TOKEN, so it does not require setting up a secret manually AFAIK.

shcheklein commented 1 year ago

I think it only uses the auto-generated secrets.GITHUB_TOKEN, so it does not require setting up a secret manually AFAIK.

If that's the case, it less important to address then. I was not sure (and I'm still not) about this. I tried to use it here https://github.com/shcheklein/PR-Merge-Register and the action failed with:

{"level":"error","message":"token not found","stack":"Error: token not found\n    at new Github (/usr/local/lib/node_modules/@dvcorg/cml/src/drivers/github.js:91:23)\n    at CML.getDriver (/usr/local/lib/node_modules/@dvcorg/cml/src/cml.js:161:35)\n    at CML.ci (/usr/local/lib/node_modules/@dvcorg/cml/src/cml.js:487:25)\n    at exports.handler (/usr/local/lib/node_modules/@dvcorg/cml/bin/cml/repo/prepare.js:13:13)\n    at /usr/local/lib/node_modules/@dvcorg/cml/node_modules/yargs/build/index.cjs:1:8993\n    at /usr/local/lib/node_modules/@dvcorg/cml/node_modules/yargs/build/index.cjs:1:4949"}

I haven't had to time dig further tbh so see if it's about token being required or something else.

dacbd commented 1 year ago

Can we just add the token rewriting patch to scmrepo if it detects its running in github actions? Until there is proper support for http.extraheader

I try and do a draft PR so you can see what it might look like.

@dacbd Do you know what exactly are the requirements to include in a GHA workflow in order to push tags via gto?

Just need to include a custom auth header when pushing git refs over https AUTHORIZATION: basic ***

shcheklein commented 1 year ago

@dacbd

Can we just add the token rewriting patch to scmrepo if it detects its running in github actions?

will it be rewriting the token that GH provides?

dacbd commented 1 year ago

@shcheklein no, we just need to pull token out of the git config entry and add it in the git origin as a username/password.

approx: https://token:ghs_***@github.com/org/repo.git'

shcheklein commented 1 year ago

@dacbd I guess makes sense to me. I would rely on @pmrowla judgement on this. Practically it's a patching it upstream vs implementing a GH-specific path on scmrepo (probably a temporary solution). My 2cs - if it's simple and quick it's fine to have it.

dacbd commented 1 year ago

sounds good, @pmrowla to save you a click on how did this.

https://github.com/iterative/setup-dvc/blob/d549770196ed130ba41697c4713281c349917d5e/src/utils.js#L43-L63

pmrowla commented 1 year ago

I don't think a GHA specific patch belongs in scmrepo, iirc some other users encountered a similar issue with aws codecommit and whatever their CI equivalent is.

If this is a priority we should just fix it properly and add support for http.extraheader so it works everywhere.

shcheklein commented 1 year ago

I don't think a GH specific patch belongs in scmrepo, iirc some other users encountered a similar issue with aws codecommit and whatever their CI equivalent is.

can it be generalized on the scmrepo level (e.g. if it's always about the same stuff as on GH)?

would it be faster compared to changing it upstream? Clearly it's the right way to do this, if it takes let's say 10x more time then it might make sense to consider alternatives.

pmrowla commented 1 year ago

It can't be generalized at the scmrepo level since the headers have to be set in the actual HTTP requests (that are crafted entirely in the upstream git backends).

The GHA example would work purely in scmrepo since the way GHA uses it ends up just setting the HTTP basic Authorization header and you can accomplish the same thing by rewriting the entire git remote URL to fill in http://username:password@github.com/... (with the gha token instead as password)

But the extraheader really lets you set any HTTP header you want, so in the event a git forge uses it to set something other than Authorization, we can't accomplish it via rewriting the remote URL (and have to set the raw header in the HTTP request)

shcheklein commented 1 year ago

Got it. Have you seen other (more or less meaningful, production) use case for the extraheader besides Auth?

pmrowla commented 1 year ago

Azure pipelines CI uses it to set the Authorization header to a bearer token (so it sets Authorization: Bearer <token string>. (the URL username:password rewrite workaround only works for setting Authorization: Basic <base64 encoded username:password>)

dberenbaum commented 1 year ago

Can GTO fallback to using CLI Git?

pmrowla commented 1 year ago

We could make GTO init the scmrepo instances with gitpython as the highest priority backend, but that's still not ideal since it means DVC will end up using gitpython when it calls into GTO.

dberenbaum commented 1 year ago

What about trying to prioritize gitpython in scmrepo if we detect we are in GitHub? Fixing in dulwich sounds best but the full sprint estimate makes it hard to prioritize.