Packages: gracefully handle 403 errors when fetching packages

Many Rust crates are currently failing to "clone" due to 403 errors https://sourcegraph.com/site-admin/repositories?status=failed-fetch

CleanShot 2022-07-22 at 14 17 58

These crates have in common that they don't exist on crates.io, for example https://crates.io/crates/alloc

CleanShot 2022-07-22 at 14 18 23

We should gracefully handle errors like this so that the Sourcegraph repos don't have a "failed fetch" status. Candidates solutions include

silently ignore 403 errors
create the repos with a readme.md that includes the error message, this would make it easier for site admins to troubleshoot the issue
delete the crate versions from lsif_dependency_repos so that we stop trying to fetch the crate versions that error

(Collecting my thoughts from Slack here in a comment)

I don't think the first two candidates are acceptable solutions

silently ignore 403 errors create the repos with a readme.md that includes the error message, this would make it easier for site admins to troubleshoot the issue

Both would leave us with invalid state in the database and (in the second case) invalid state in the repository (the repo would then not reflect the content of the repo it should be anymore and the the clone status would be wrong, etc.)

I think the only acceptable solution is to either

delete the crate versions from lsif_dependency_repos so that we stop trying to fetch the crate versions that error

do not insert into lsif_dependency_repos if the crate doesn't exist (we have a lot of garbage repos on sourcegraph.com right now that don't exist and we keep trying to resync/refetch them)

Taking a step back, I think the larger problem is that package hosts don't work like code hosts in the framework of our external services.

Roughly speaking, repositories from code hosts end up like this in Sourcegraph:

External service is added to the external_services table, including configuration that says which repositories to sync form the code host
A Source that wraps a client for the external service is instantiated, its ListRepos method is called.
ListRepos talks to the code host to get the list of repositories specified in the external service configuration (often in batch requests, or using the code host's search API, etc.)
The Syncer in repo-updater then compares the result of the ListRepos method with what's in the repo table: new repos are added, repos that are in repo but not in result of ListRepos are deleted, repos that exist in both are updated if necessary.
gitserver clones the newly added repositories in repo

With package hosts the situation is a bit different:

External service is added to the external_services table, 99% of the time not containg a list of repositories to sync
Same as above: a Source that wraps client for package host is instantiated, its ListRepos method is called
ListRepos doesn't talk to the package host to determine which packages to add to the DB. Instead of lists the entries in lisf_dependency_repos (which was populated by auto-indexing, dependency search, Rust crates syncer, or auto-add of package versions): one entry per commit per package. Exception: for NPM packages a request is made to check whether a package exists.
Same as above
Same as above

The difference: with code hosts the API of each code host is asked on every sync "does this repo exist? is it updated?", with package hosts the source of truth is lsif_dependency_repos and that table is never updated.

That means, in practical terms and in the context of this ticket:

When user configures GitHub external service to clone repo does-not-exist/does-not-exist, no repository is inserted into database, no attempt to clone it is made. If repo is deleted on GitHub, it'll get removed from repo table on next sync and then later cleaned up on gitserver.
User configures package host external service and does-not-exist/does-not-exist is added to lsif_dependency_repos. That is turned into a repo row and gitserver then fails to remove it. It will always stay in repo as long as it exists in lsif_dependency_repos and ListRepos doesn't confirm whether the package exists on the code host.

Ideally, I think, we would

not enter invalid entries into lsif_dependency_repos (we could syntactically check the names and/or check with the package host)
remove entries from lsif_dependency_repos when we notice the package doesn't exist
not mix abstraction layers: I really think gitserver shouldn't know about lsif_dependency_repos and neither read from nor write to it.

(Edit: I know that's not a description of a concrete solution, but solution contraints. Once we agree on constraint on what a solution should look like, I can help make it more concrete)

not mix abstraction layers: I really think gitserver shouldn't know about lsif_dependency_repos and neither read from nor write to it.

Can you elaborate on this goal? The easiest solution from my perspective would be to address this issue in vcsPackagesSyncer by doing some kind of write to the database when a version fails to download.

https://sourcegraph.com/github.com/sourcegraph/sourcegraph@42a9107ed81c92cb5f234d952e8eee0f4836242b/-/blob/cmd/gitserver/server/vcs_packages_syncer.go?L279

For example, we create a new table to track download failures and implement exponential backoff to gracefully skip re-downloading versions that have failed in the past. We could then add a site admin UI to manually trigger a failed package to re-download.

Can you elaborate on this goal?

I think what goes into lsif_dependency_repos should be of no concern to gitserver. It should only care about its own table and what repo-updater has in repos. We already have a lot of writers into that table and to keep complexity and possible unintended interactions between subsystems to a minimum, I'd try to decouple gitserver and the cloning from what's in that table.

And that being said: even if gitserver would delete entries from that table, the crates.io indexer or auto-indexing would just add them back, right?

That's where we come back to the difference between package host and code host: if a repo disappears on the code host, it will automatically disappear from the Sourcegraph instance if the external service is configured correctly. I think whatever we build should tap into that.

We could then add a site admin UI to manually trigger a failed package to re-download.

We already have that, kinda! This UI here

and this site-admin page here

allow you to see errors and trigger a re-clone. I think we should reuse that UI and mechanisms as much as possible.

For example, we create a new table to track download failures and implement exponential backoff to gracefully skip re-downloading versions that have failed in the past.

That's kinda what I'm thinking too!

For example: we could have a worker job that periodically scans the repo table for package host repositories that have failed to clone (by looking at gitserver_repo too?) and then denylists these entries in lsif_dependency_repos and/or marks them as invalid. The PackagesSource would then need to not yield these entries anymore, which would cause them to be removed from the instance.

Or the worker could try to talk to the registry and check whether the package should exist etc.

(The larger question that comes to my mind is: maybe we should have a separate UI for packages in site-admin area that essentially lists the lsif_dependency_repos table - ideally with more metadata, such as origin of package name, when it was added, etc. - and hide packages from the repo view...)

And that being said: even if gitserver would delete entries from that table, the crates.io indexer or auto-indexing would just add them back, right?

Correct, and that's why exponential backoff is a better solution than deleting the version.

(The larger question that comes to my mind is: maybe we should have a separate UI for packages in site-admin area that essentially lists the lsif_dependency_repos table - ideally with more metadata, such as origin of package name, when it was added, etc. - and hide packages from the repo view...)

+1 this would very much be in the spirit of RFC 698 - make packages a first-class entity https://docs.google.com/document/d/1aGTzOKXJkf29FXUaDb3uevmIXmX09hYS11UKSp6j2SA/edit#heading=h.eklszdrbn53c

One difference between packages and repos that complicates reusing architecture/UI is that packages are version-based (every version is a unique HTTP request that can independently fail) while repos are .. repo-based (git fetch fetches all revisions). For example, we should avoid solutions that involve "re-cloning" an entire package (all versions) just because a single version is missing.

The PackagesSource would then need to not yield these entries anymore, which would cause them to be removed from the instance.

What do you do in the case when only a single version fails among other successful versions? PackagesSource is an unsuitable place IMO to implement error handling because it works are the granularity of a repo instead of versions. We already tried to implement the sync for all versions in PackageSource and saw the incredibly large volume of requests that we sent to the package host, depleting our rate limits quickly.

Proposal

We ignore separation of concerns for packages and allow gitserver to write to lsif_dependency_repos (it's already reading from the table). We add two new columns to the table: timestamps for the previous and next request. These columns are empty by default, and are only populated when a request errors. If a request errors then we write timestamps into these columns to document when the next request should be attempted. We could optionally store the error in a separate column to display in a dedicated "Packages" admin UI (where we could have buttons to reset the exponential backoff).

We ignore separation of concerns for packages and allow gitserver to write to lsif_dependency_repos (it's already reading from the table).

I don't think "it reads, so it's okay to write" is a good argument. Can we have another way to provide feedback about clone failures back to lsif_dependency_repos? Separate table?

If a request errors then we write timestamps into these columns to document when the next request should be attempted.

Who reads these timestamps and uses them?

I don't think "it reads, so it's okay to write" is a good argument. Can we have another way to provide feedback about clone failures back to lsif_dependency_repos? Separate table?

I agree it's a bad argument. What do you think about writing the error into the repo itself? We would effectively replace the db with the git repo itself to store metadata about request failures. The timestamps would then be read by gitserver to determine whether to skip a request or not.

What do you think about writing the error into the repo itself? We would effectively replace the db with the git repo itself to store metadata about request failures. The timestamps would then be read by gitserver to determine whether to skip a request or not.

That feels like a hack to me. Repository contents are "data plane" and not "control plane" and mixing one with the other, I suspect, will lead to all sorts of pain. (Kinda like how you don't put an image into a photo album that says "error: connection lost" when the user's upload didn't work.)

Instead of adding a new mechanism, I'd rather try to expand the existing ones. For example: we already have retries in repo-updater if gitserver fails to clone: https://github.com/sourcegraph/sourcegraph/blob/b523c351f0fe6e075855ac45236bc2bfa1872afe/internal/repos/scheduler.go#L211-L234

If we now add another mechanism to gitserver itself to do retries those could/would clash. Or you'd need to turn this existing mechanism off in certain cases and have the other one take over, etc. which also doesn't sound too easy to pull off correctly.

So could we maybe extend that UpdateScheduler to, say, do something different if a specific error is returned from gitserver (the vcsPackageSyncer could return a special error, for example)? Maybe it can put these repos in a dead queue or push them to a channel and then we have another goroutine in repo-updater that listens to that goroutine and deletes/marks these repos? We have a goroutine that goes in the other direction and puts repos into the UpdateScheduler once they're synced from code/package host.

So could we maybe extend that UpdateScheduler to, say, do something different if a specific error is returned from gitserver (the vcsPackageSyncer could return a special error, for example)?

How would gitserver find out what versions to skip from lsif_dependency_repos?

It should be relatively easy to update gitserver to return structured errors about what versions failed to download. One concern I have with this approach is that it should ideally be easy for site admins to understand why a given version is failing to download.

How would gitserver find out what versions to skip from lsif_dependency_repos?

Hehe, good point, I implicitly assumed we'd combine the idea in my last comment with the other ideas from above, namely: once we have a goroutine that monitors dead repositories, it should probably persist that information somewhere. We could, for example, have a structured error returned from gitserver that contains what tag/version couldn't be cloned. Then the dead-repo-monitoring goroutine could delete/flag those versions in lsif_dependency_repos.

(The problem we have right now is that we don't have a database link between repo and lsif_dependency_repos, which we should probably have at some point and might also be a requirement for this here to work)

Also, ping @sourcegraph/repo-management since they've probably thought about all of these ideas here before (retries, exponential backoff, recording clone status, etc.)

A lot of our existing infrastructure is centered around repos while packages need infrastructure around versions. Is there precedence to write into the db from gitserver? If it's OK for gitserver to write into the db, then I'd vote for that solution since it seems like the simplest solution that can be implemented with a minor diff compared our existing solutions today.

I mean, I've given you a few of my reasons for why I don't think it's OK for gitserver to write into the DB, but I'd say it's your decision ultimately.

the simplest solution that can be implemented with a minor diff compared our existing solutions today.

I think we've reached a point where a smallest-possible-diff strategy might not get us away from this local maximum where packages are kinda like repos but also not and everwhere we try to make them behave like repos there's friction. But again: if you're owning packages and maintain them long term, then it's your decision.

@mrnugget @olafurpg 👋

I guess the most valid example of such stuff in the code we own is external service syncing (here).

It has an exponential backoff logic, saving the timestamps of a next sync in a next_sync_at column of external_services table, tracking how long it has been since last successful sync in last_sync_at column.

Another place to take a look is repo cloning and gitserver_repos table, featuring clone_status, last_error and last_fetched columns which can be useful when anything goes wrong. Clone logic and timestamps/error updates is located here.

But again: if you're owning packages and maintain them long term, then it's your decision.

This is the question that jumps to the front of my mind reading through this discussion. @olafurpg - what ownership model are you proposing for packages in the short and long term?

Edit: we can also take this to Slack 🙂. I don't want to derail this discussion.

@jplahn I opened a separate issue to discuss ownership of packages https://github.com/sourcegraph/sourcegraph/issues/39635

I thought I had a write-up on the solution :horse::keycap_ten: and I came up with a month ago, but I either had an "ill write it up tomorrow" moment or I genuinely cant find it. Either way, I spent the past few hours re-reading slack threads and github issues on this particular issue and heres what I've ultimately come up with. I think it strikes a good balance between both sides of the table here, so I'll take a valiant attempt at writing up the beast now.

Pre-context:

The war of wants

Theres a delicate relationship between gitserver and repo-updater and their respective tables. Repo-updater discovers repos by querying the external services and inserting these into the repo table (which populates gitserver_repos by trigger). If a repo is not discoverable/visible, it won't be entered into repo.

Package repos break away from this behaviour slightly. Package repos are discovered in 2-3 ways (I'll only focus on #1 and #2 as the third is dotcom and crates specific), either by being referenced in a codeintel index (inserted into lsif_dependency_repos) or by being statically referenced in the config for a package repo external service. Neither of these sources give any guarantee that the package repo is actually resolvable by the sourcegraph instance.

The original (non-package repo) behaviour is that which Thorsten would prefer to maintain, as it keeps the repo and gitserver_repos table free of a potentially large mass of repos that will never become resolvable, as well as breaking the contract of what each component is actually meant to do. Given the aforementioned lack of guarantees from package repo references, we may end up with a varying degree of entries in repo etc that are not and may never be resolvable. Bad housekeeping, youre fired.

At the same time, at the time(s) of discussion, there were reasons why this less elegant approach was (unfortunately) on the better side of things: 1) surfacing sync errors in the UI was not implemented for package repo references was most easily implemented through the existing infrastructure for surfacing failed git clones/fetches. This necessitated having the entries in repo 1) Reduced request spam/borderline DoS to the package repo external services. On every sync of a package repo external service, we would have done a request for every single package reference we have stored to check for existence and/or fetch metadata like description (the latter being a whole other thing that may be addressed by whats to follow, but not important for this here). With over 104k packages on crates.io, and over a million on npm, the upper bound on number of requests done is insane. And this would happen on every sync, and then individual requests when requesting an individual repo update, and also at least once for every version to sync that version...you get the picture here.

Hence, repo-updater doesnt actually check for the existence of a package repo, it just creates a repo object as if everything was swell, gitserver attempts to clone the potentially non-existent package repo, and we surface the error in the UI. Works:tm:, but we can do better to make everything slot into place more naturally and less restrictively.

The lay of the land since then

Talk is well n good (we love talking :mega:), but engineering time is super constrained, and lack of buy-in and other priorities compound the problem, so no one actually did anything (well, I attempted something at the time but in hindsight it really was terrible and thankfully I thought so too when I pushed the branch).

Packages has since become a job-fair project with one full-time backend engineer (for now), so we can finally throw me at the problem and see if i stick on the wall or leave for Google (/s).

On the technical side of things: 1) package repo references and their versions are split out into separate tables (lsif_package_repos and package_repo_versions) for a 1:N relationship (eventually, another migration needed for backcompat reasons). This'll be somewhat important/"nice-to-have" later (little did i know :clueless:) 2) a new site-admin page is being introduced at /packages or wherever to show entries from lsif_package_repos (table rename pls) and some site-admin buttons n stuff for managing them. This is slightly more important later

The Proposal

Now that the required reading is out of the way, lets actually talk SOLUTIONS. Lets see if I can reconstruct it through the late-night beer.

1) lsif_dependency_repos will remain a source of unguaranteed-to-be-resolvable package repo references. That word here I think is key, or maybe ive just inflated its meaning. Onto the next point 1) Add some sync columns to lsif_dependency_repos so we can track last package meta sync error, consecutive sync fail count (for exponential backoff + limit), last sync time and you get the idea. This way we can have resolvable package repos only be resynced on a fixed schedule (once a week/once a day/:banana: depending on what kind of metadata we want to sync and keep updated in the future). We can use this logic in combination with existing sync infrastructure such as this should-delete checking to never create an entry in repo for packages not resolvable from the get-go and/or delete package repos that existed but are no longer resolvable after a certain retry count (some fiddly bits around unscheduled vs scheduled syncs and the tracking of those results), while also not having to perform a h*ck load of requests for a known good (within a certain timeframe in the past) package repo on every individual or ecosystem-wide sync. 1) Add similar columns to package_repo_versions to track the syncing of individual versions as opposed to an entire package. This doesn't give us as much, outside of potentially finer-tuned timing of deletion of tags that dont exist anymore, as well as surfacing sync errors of individual versions in the UI.

In combination with the new separate /packages site-admin page, this also lets us surface the error in the UI all before we ever hit the repos table, while minimizing the amount of requests we have to make over a period of time. Cash money :LGTM: The largest issue I potentially see with this is when new packages are introduced, we have still have the >=2 requests (sync meta in repo-updater and sync code in gitserver). Ultimately I dont think this is avoidable, reducing the frequency of follow-up requests from repo-updater means we can do a request for a package in repo-updater as infrequently as once a week (configurable). This strikes a pretty good balance between staleness and request load, and should most likely of course be accompanied with tools (see: UI buttons n gizmos) to manually wrangle the underlying beast safely and easily for one-offs. This does mean that ultimately gitserver will need to write to the package-repo related tables, but this is relatively minor. Needs must

Fine print and further considerations

Theres some further nuance around cases such as the following:

Gitserver sync error specifics

1) What do we do in the case of a specific version failing to sync in gitserver but with existing other versions having been synced?

Not return an error in Fetch, resulting in gitserver_repos.last_error not being set?
Do return an error, which marks the "entire" git repo as not having synced successfully (what repercussion does that have today and in the future?) 1) What do we do in the case of a specific version failing to sync in gitserver but with no other versions having been synced yet?
Not return an error in Fetch, resulting in gitserver_repos.last_error not being set? This results in an empty git repo being created, not desirable according to Thorsten (and I agree, either successfully sync or bust :gun: )
Do return an error, which marks the "entire" git repo as not having synced successfully (seems correct, but also we dont know at this time whether its version specific or package specific, although I would assume that we wouldnt be at this point given the existence checks in repo-updater? thoughts?)

UI error surfacing

With this proposal, we have >=3 places where an error may be recorded (is there one for repo? Didnt see one) 1) lsif_dependency_repos: the package in its entirety doesnt seem to be resolvable, we couldnt fetch any metadata. Set in repo-updater 2) package_repo_versions: a specific version could not be resolved. Set in gitserver 3) gitserver_repos: see above for uncertainty around when/whether errors should be inserted here

Today, /repositories shows the error from gitserver_repos. Makes sense :+1: well n cool. Proposal thus far leads to showing lsif_dependency_repos.<last_error_or_whatever> at one level of /packages, and package_repo_versions.<last_error_or_whatever> in a e.g. collapsible list for each individual version. But what about gitserver_repos.last_error? :thinking: This mostly hinges on the answers to the previous, but worth bringing up for further context on it.

Theres absolutely further concerns at varying depths of specificity (abstract to code line-level), but at this stage theres enough to read and consider here anyway, so Ill cut the ~~rambling~~ deep-thought-proposal there. ~~Inb4 someone comes along with a one-liner question that throws the whole thing under the bus.~~ Lets get the ball rolling on this so we can finally solve it, so keep discussion pragmatic, constructive, and with forward momentum :sunglasses:

sourcegraph / sourcegraph-public-snapshot