sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Research: Simplify access to repository tokens for gitserver #14910

Closed asdine closed 3 years ago

asdine commented 4 years ago

For Gitserver to be able to clone and fetch repos it needs to get tokens from repo-udpater and frontend. The way it is currently implemented poses several problems:

Proposal

If this proposal is accepted, the bullet points above would be broken down into multiple issues

github-actions[bot] commented 4 years ago

Heads up @tsenart - the "team/cloud" label was applied to this issue.

tsenart commented 4 years ago

Make Gitserver use the new type/package instead of calling the other services to get tokens, and remove/deprecate unecessary endpoints on repo-updater and frontend if any.

Can you elaborate on this, it'd be useful to list all places that need updating.

tsenart commented 4 years ago

Make Gitserver use the new type/package instead of calling the other services to get tokens

If we're creating a separate custom git credential helper, do we need to modify git-server to read tokens from Postgres?

tsenart commented 4 years ago

Write a type/package that provides an unified way of getting a token/clone url for a given repo, and make all services that need to access the token use that type.

What are all the places that'd need changing?

asdine commented 4 years ago

Can you elaborate on this, it'd be useful to list all places that need updating.

What are all the places that'd need changing?

I'll update the issue with more details on that aspect

If we're creating a separate custom git credential helper, do we need to modify git-server to read tokens from Postgres?

The custom git credential helper is a binary/script executed by git. I don't know how far we can go with this program but we were thinking that the easiest would be making it simply call a gitserver endpoint and let gitserver fetch the token for it.

asdine commented 4 years ago

cc @sourcegraph/cloud @keegancsmith

tsenart commented 4 years ago

In general, SGTM. Hope we can also get +1 from @keegancsmith, but it's not a blocker.

keegancsmith commented 4 years ago

In general, SGTM. Hope we can also get +1 from @keegancsmith, but it's not a blocker.

Will review tomorrow morning.

keegancsmith commented 4 years ago

LGTM! Great stuff.

Current

I'm writing out the current state to ensure my understanding is current and to align. We do git fetch/clone in two main ways:

HTTP load

  1. Web-browser loads sourcegraph.example.com/$repo
  2. sourcegraph-frontend tells repo-updater to update $repo
  3. repo-updater's queue tells gitserver to update/clone $repo with $remote-url.
  4. gitserver fetches/clones $repo with $remote-url. Sets git config remote.origin.url as $remote-url.

EnsureRevision

  1. Browser extension does API request for $repo@$commit.
  2. sourcegraph-frontend sends request to gitserver with the option to ensure $repo@$commit exists.
  3. If gitserver doesn't have $commit. It does a git fetch using the config remote.origin.url.

There is also some extra logic around sourcegraph-frontend hydrating in $remote-url with a retry, but I'm not sure if this logic even works to be honest. I'm very excited about removing that, and instead having gitserver able to hydrate it.

Proposal

The proposal is removing any mention of $remote-url above. Instead, gitserver will be asking repo-updater for $remote-url. This sounds good to me. I have a slight concern that gitserver will now have a dependency on repo-updater which is a SPoF. But that can be solved in practice, and any downtime in repo-updater will be temporary and just result in a delay in updates via EnsureRevision.

Comments

  • Write a type/package that provides an unified way of getting a token/clone url for a given repo, and make all services that need to access the token use that type.

The proposal doesn't make it clear where the clone_url will be stored and who can read it. I am assuming it will be repo-updater which implements the "get me a token", and it will fetch it from the external services table?

Or will this type/package directly speak to the DB?

This type would choose a random external service associated with the repo and read the token from that external service config. (2d)

A random token will make it hard to debug when things go wrong. However, it does give us "eventual" updates assuming a single token works. I'm unsure about how to do this better, since its better than never updating a repository.

  • Deprecate/Drop the external_service_repos.clone_url column (1d)

If the order of the list is the order you intend to do this work, I would do this last. Otherwise the order looks great.

  • Make Gitserver use the new type/package instead of calling the other services to get tokens, and remove/deprecate unecessary endpoints on repo-updater and frontend if any. (2d)

Gitserver doesn't call any other services. It gets told the clone URL. See my description above. So this bit of work may slip and take the most time of all the points.

  • (Spike) Write a custom git credential helper to let git ask for a token whenever it's needed. This helper would call a gitserver endpoint to get a token on demand. This way, no token would be stored on disk. (1d)

This sounds like a great! Some random thoughts, answering them may take a day, but once answered another day to implement :)

  1. What URL will you store? IE you still need to set remote.origin.url. Your credential helper then receives that.
  2. How will you tell git to use the credential helper? envvar, repo config, global config?
  3. How does it interact with SSH?
  4. We have admins sometimes setting the git config / mounting secrets / etc. Does this interfere with that? If so, how do we still allow that use case?
  5. What will the credential helper be? IE will you ship a new binary in all our gitserver containers? Or maybe you can write a shell script to a temporary path and execute that?
unknwon commented 3 years ago

@keegancsmith:

Thanks for a very thoughtful response!

Proposal

The proposal is removing any mention of $remote-url above. Instead, gitserver will be asking repo-updater for $remote-url. This sounds good to me. I have a slight concern that gitserver will now have a dependency on repo-updater which is a SPoF. But that can be solved in practice, and any downtime in repo-updater will be temporary and just result in a delay in updates via EnsureRevision.

To be precise, we will remove the user credential (i.e. access token) from the $remote-url, but the network location of the repository is still living in each individual .git/config.

Comments

  • Write a type/package that provides an unified way of getting a token/clone url for a given repo, and make all services that need to access the token use that type.

The proposal doesn't make it clear where the clone_url will be stored and who can read it. I am assuming it will be repo-updater which implements the "get me a token", and it will fetch it from the external services table?

Or will this type/package directly speak to the DB?

"this type/package" directly speak to the credential storage (based on config, could be DB, Vault, GSM, etc.). On top of that, we serve endpoints (in frontend, or standalone service) to whoever (i.e. gitserver) needs a token for a particular repository on-demand.

This type would choose a random external service associated with the repo and read the token from that external service config. (2d)

A random token will make it hard to debug when things go wrong. However, it does give us "eventual" updates assuming a single token works. I'm unsure about how to do this better, since its better than never updating a repository.

  • Deprecate/Drop the external_service_repos.clone_url column (1d)

If the order of the list is the order you intend to do this work, I would do this last. Otherwise the order looks great.

@asdine: I feel I might be missing something here, I thought it is enough that we only drop the credentials from the external_service_repos.clone_url not the entire column? Otherwise seems unintuitive how to quickly identify what's the clone URL of a given repo? (It's just my concern, will have to verify)

  • (Spike) Write a custom git credential helper to let git ask for a token whenever it's needed. This helper would call a gitserver endpoint to get a token on demand. This way, no token would be stored on disk. (1d)

This sounds like a great! Some random thoughts, answering them may take a day, but once answered another day to implement :)

  1. What URL will you store? IE you still need to set remote.origin.url. Your credential helper then receives that.

Yes we still need to set remote.origin.url, mentioned above.

  1. How will you tell git to use the credential helper? envvar, repo config, global config?

We can achieve that via the global .gitconfig:

git config --global credential.helper <path to our credential helper>
  1. How does it interact with SSH?

From what I read, git shells out to ssh so git credential helper won't be part of it, we need a complementary solution to support that (e.g. maintains a .ssh/config file per gitserver), that is out of scope unfortunately.

  1. We have admins sometimes setting the git config / mounting secrets / etc. Does this interfere with that? If so, how do we still allow that use case?

Git credential helper is involved only when the git CLI couldn't determine what's the credential (in an interactive shell, you get prompted to enter username and password). So if admins configured credentials in other ways that is readable by git CLI, all of them will continue to work. Think of Git credential helper as the catch-all.

  1. What will the credential helper be? IE will you ship a new binary in all our gitserver containers? Or maybe you can write a shell script to a temporary path and execute that?

A shell script is enough to get started but @asdine and I worried the maintainability and the expertise we have in bash 😅 So most likely will be a Go binary shipped in all our gitserver containers.

keegancsmith commented 3 years ago

The reason I mentioned a shell script is it's just a bit of glue so you avoid needing to ship an extra binary. IE your shell script can simply speak to a local unix or tcp socket that speaks to gitserver. So won't be more than a few lines and potentially static.

keegancsmith commented 3 years ago

@tsenart and I are hacking on removing some tech debt around CachedGitRepo, GitRepo and CommandRetryer. We have come up with a plan, which essentially is implementing the "get remote URL" part of this issue. We may have actually solved all of this. So if someone plans to tackle this (if we haven't closed it yet), reach out to us.

asdine commented 3 years ago

@keegancsmith sounds good! I'm not planning on working on this issue this cycle