openSUSE / zypper

World's most powerful command line package manager
http://en.opensuse.org/Portal:Zypper
Other
392 stars 107 forks source link

Instability when updating repo and using repo_gpgcheck #549

Open sdherr opened 2 months ago

sdherr commented 2 months ago

Description

There currently exists a timing window that can lead to client errors if you are using signed repo metadata with repo_gpgcheck=1. When you are using that option of course the server provides both the normal repomd.xml file and also a detached signature of it in repomd.xml.asc. Even if the server swaps out both files atomically on a repo update, an unlucky client can have received the old version of repomd.xml and the new version of repomd.xml.asc, if its requests happened to come on either side the update instant. The signature will then fail to verify and the client will return an error. Furthermore if the server is using any type of caching system or CDN, this mis-match can get persisted in the cache up to the TTL of the files if you are unlucky, causing many client errors and making the repo essentially unavailable for a period of time.

Problem

Unlike the other metadata files, there is no hash information in the name of repomd.xml.asc, or any other indication of what version of the file you want. This is likely because there is a chicken-and-egg problem of writing this info into repomd.xml: the file must exist first before you can sign it or calculate its hash, and the act of adding that information back in to repomd.xml invalidates the signature or changes the hash.

I furthermore assume that this is a detached signature in the first place instead of simply clearsigning repomd.xml to keep it one atomic file because there were worries about backwards-incompatibility with old clients.

Solution

Any change which would indicate which version of repomd.xml.asc that you need to the server is a potential fix, but I think there are three main possibilities:

  1. Changing the name of the file like you do with other metadata files, with the name calculated client-side somehow.
  2. Adding a header to the request.
  3. Adding a query parameter to the request.

In particular for backwards compatibility and ease-of-update, I suggest we do number 3, simply take the Last-Modified header value on the repomd.xml response, and add it as a query parameter on the repomd.xml.asc request: .../repomd.xml.asc?repomd_last_modified=<header value>.

  1. Almost every webserver simply ignores extra unexpected query params, so this should not break anything. On the off chance that it does and a server responds with a 4xx HTTP code, zypper could simply revert back to today's behavior and request the file without the query param.
  2. By contrast, caches / CDNs typically do include the query params as part of the cache key by default.

By including the information on which repomd.xml file you have in the cache key for repomd.xml.asc we are creating different cached versions of the file, and the client will automatically receive the correct one and proceed successfully if these cache keys are already populated. If there is no cache involved and a server is responding to every request directly, at least they are in no worse position than they were before, and they can teach their server to respond appropriately to the query param if they wish.

There is still a possible timing window here where bad pairs can get cached if both cache TTLs time out and a client re-fills the cache with an old version of repomd.xml and a new version of repomd.xml.asc?repomd_last_modified=123 from a server that is ignoring the query string. However this gives servers the ability to close this gap permanently and respond correctly to every request, by doing either:

  1. Teach the webserver to respond correctly to the query param, or
  2. Increase the cache TTL on */repomd.xml.asc?repomd_last_modified=* to more than double the TTL of repomd.xml, eliminating the overlapping timing window where it's possible to cache incompatible versions of the files.

I have filed corresponding bugs against both dnf and tdnf. https://github.com/rpm-software-management/libdnf/issues/1664 https://github.com/vmware/tdnf/issues/475

bmwiedemann commented 1 month ago

Note, that in our mirrorcache redirector, we set the expiry to be around the same second for both files, so that the window of error is only 1s in 256s with a caching proxy or CDN. I like your proposal as it could help reduce problems from cached repo-metadata.

> curl -s -I download.opensuse.org/tumbleweed/repo/oss/repodata/repomd.xml{,.asc} | grep age
cache-control: public, max-age=254
cache-control: public, max-age=253
> curl -s -I download.opensuse.org/tumbleweed/repo/oss/repodata/repomd.xml{,.asc} | grep age
cache-control: public, max-age=251
cache-control: public, max-age=251

I also noticed that we don't send a Last-Modified header, but an etag header that could serve the same purpose.

The repomd.xml has a revision value, e.g. the current one for Tumbleweed is 1716631215. We could patch the signer and zypper to use this one (e.g. repomd-1716631215.xml.asc) and keep a symlink or copy on the old name. This has the advantage that it will work with any of the 100 mirrors, because it is a plain file that is versioned.

Additionally, zypper could auto-retry once in case of a 404 or version/signature-mismatch.

sdherr commented 1 month ago

Just a note about the window of error / auto-retries: your window of error is not really 1 second. If both files age out of the cache, and then a client requests them and happens to hit the unlucky during-a-repo-update boundary and the cache is refilled with the old version of repomd.xml and the new version of repomd.xml.asc, then that broken state will persist for the entire cache TTL (~ 4 minutes in your case it looks like) and all client requests will error until they both age out again. Making the client auto-retry doesn't really help.

sdherr commented 1 month ago

I'm perfectly happy with the revision suggestion. You might want to coordinate that with the other yum-ish clients.

bmwiedemann commented 1 month ago

I'm perfectly happy with the revision suggestion. You might want to coordinate that with the other yum-ish clients.

OK, let's use https://github.com/rpm-software-management/librepo/issues/237#issuecomment-2141173042 to have the discussion in one place.

andrii-suse commented 1 month ago

Just a note about the window of error / auto-retries: your window of error is not really 1 second.

(note that 1 second difference is just because second request was later - if they would be somehow at the same moment - the number will be the same)

But that problem is a bit more complicated, actually there are two issues:

  1. CDN did have incorrect caching instructions for repomd.xml* , so the files could be way out of sync for those who used cdn.opensuse.org . It was fixed on May, 21. So after around May, 22 the cdn nodes should have correct caching instructions ( https://progress.opensuse.org/issues/159108#note-12 )
  2. It is still theoretically possible that the pair of repomd.xml{,.asc} files is changed while being downloaded by zypper (two generals problem). The biggest issue here is that the old files have gone completely from the server, and that is not easy to fix on server side. Moreover - old files are not needed once the new ones are available.

I can think about the only possible way to fix it - it is the client side. So, zypper should keep in mind such situation and retry retrieving the xml and signature pair (and probably some similar operations related to metadata, but that is more broad topic).