Open sdherr opened 6 months ago
Note, that in our mirrorcache redirector, we set the expiry to be around the same second for both files, so that the window of error is only 1s in 256s with a caching proxy or CDN. I like your proposal as it could help reduce problems from cached repo-metadata.
> curl -s -I download.opensuse.org/tumbleweed/repo/oss/repodata/repomd.xml{,.asc} | grep age
cache-control: public, max-age=254
cache-control: public, max-age=253
> curl -s -I download.opensuse.org/tumbleweed/repo/oss/repodata/repomd.xml{,.asc} | grep age
cache-control: public, max-age=251
cache-control: public, max-age=251
I also noticed that we don't send a Last-Modified
header, but an etag
header that could serve the same purpose.
The repomd.xml has a revision
value, e.g. the current one for Tumbleweed is repomd-1716631215.xml.asc
) and keep a symlink or copy on the old name. This has the advantage that it will work with any of the 100 mirrors, because it is a plain file that is versioned.
Additionally, zypper could auto-retry once in case of a 404 or version/signature-mismatch.
Just a note about the window of error / auto-retries: your window of error is not really 1 second. If both files age out of the cache, and then a client requests them and happens to hit the unlucky during-a-repo-update boundary and the cache is refilled with the old version of repomd.xml
and the new version of repomd.xml.asc
, then that broken state will persist for the entire cache TTL (~ 4 minutes in your case it looks like) and all client requests will error until they both age out again. Making the client auto-retry doesn't really help.
I'm perfectly happy with the revision
suggestion. You might want to coordinate that with the other yum-ish clients.
I'm perfectly happy with the
revision
suggestion. You might want to coordinate that with the other yum-ish clients.
OK, let's use https://github.com/rpm-software-management/rpm-metadata/issues/1 to have the discussion in one place.
Just a note about the window of error / auto-retries: your window of error is not really 1 second.
(note that 1 second difference is just because second request was later - if they would be somehow at the same moment - the number will be the same)
But that problem is a bit more complicated, actually there are two issues:
I can think about the only possible way to fix it - it is the client side. So, zypper should keep in mind such situation and retry retrieving the xml and signature pair (and probably some similar operations related to metadata, but that is more broad topic).
Description
There currently exists a timing window that can lead to client errors if you are using signed repo metadata with
repo_gpgcheck=1
. When you are using that option of course the server provides both the normalrepomd.xml
file and also a detached signature of it inrepomd.xml.asc
. Even if the server swaps out both files atomically on a repo update, an unlucky client can have received the old version ofrepomd.xml
and the new version ofrepomd.xml.asc
, if its requests happened to come on either side the update instant. The signature will then fail to verify and the client will return an error. Furthermore if the server is using any type of caching system or CDN, this mis-match can get persisted in the cache up to the TTL of the files if you are unlucky, causing many client errors and making the repo essentially unavailable for a period of time.Problem
Unlike the other metadata files, there is no hash information in the name of
repomd.xml.asc
, or any other indication of what version of the file you want. This is likely because there is a chicken-and-egg problem of writing this info intorepomd.xml
: the file must exist first before you can sign it or calculate its hash, and the act of adding that information back in torepomd.xml
invalidates the signature or changes the hash.I furthermore assume that this is a detached signature in the first place instead of simply clearsigning
repomd.xml
to keep it one atomic file because there were worries about backwards-incompatibility with old clients.Solution
Any change which would indicate which version of
repomd.xml.asc
that you need to the server is a potential fix, but I think there are three main possibilities:In particular for backwards compatibility and ease-of-update, I suggest we do number 3, simply take the
Last-Modified
header value on therepomd.xml
response, and add it as a query parameter on therepomd.xml.asc
request:.../repomd.xml.asc?repomd_last_modified=<header value>
.4xx
HTTP code, zypper could simply revert back to today's behavior and request the file without the query param.By including the information on which repomd.xml file you have in the cache key for
repomd.xml.asc
we are creating different cached versions of the file, and the client will automatically receive the correct one and proceed successfully if these cache keys are already populated. If there is no cache involved and a server is responding to every request directly, at least they are in no worse position than they were before, and they can teach their server to respond appropriately to the query param if they wish.There is still a possible timing window here where bad pairs can get cached if both cache TTLs time out and a client re-fills the cache with an old version of
repomd.xml
and a new version ofrepomd.xml.asc?repomd_last_modified=123
from a server that is ignoring the query string. However this gives servers the ability to close this gap permanently and respond correctly to every request, by doing either:*/repomd.xml.asc?repomd_last_modified=*
to more than double the TTL ofrepomd.xml
, eliminating the overlapping timing window where it's possible to cache incompatible versions of the files.I have filed corresponding bugs against both
dnf
andtdnf
. https://github.com/rpm-software-management/libdnf/issues/1664 https://github.com/vmware/tdnf/issues/475