ostreedev / ostree

Operating system and container binary deployment and upgrades
https://ostreedev.github.io/ostree/
Other
1.23k stars 287 forks source link

delta-only repositories #729

Open cgwalters opened 7 years ago

cgwalters commented 7 years ago

In the Fedora/CentOS case where by default we rely on e.g. university-owned mirrors that might be some random ext4 server and not a proper object store, we can hit performance issues with the archive format.

It should be quite possible to make it easier for server operators to manage a "delta-only" repository. See also: https://github.com/ostreedev/ostree/pull/701

So it's delta-only + single "from empty" delta for the latest.

I think it'd be possible to cobble this together today via ostree static-delta generate --min-fallback-size 100000 for each delta you want, then ostree summary -u, then sync the summary and deltas/ content to the "delta repo".

alexlarsson commented 7 years ago

I think this sounds good, as long as it properly falls back to the "from empty" delta if we're pulling from "not the next-to-latest" local version.

cgwalters commented 7 years ago

(But we need some unit test coverage, and there's various enhancements one could make on top of this like being able to fall back to a separate archive repo for e.g. downgrades)

cgwalters commented 7 years ago

Also, one thing occurs to me - we'd at least need to maintain the commit objects in the repo, otherwise prune would prune the deltas.

dustymabe commented 7 years ago

(But we need some unit test coverage, and there's various enhancements one could make on top of this like being able to fall back to a separate archive repo for e.g. downgrades)

does this issue cover the creation of unit tests for static delta only repos or do we need another ticket for that?

Also, one thing occurs to me - we'd at least need to maintain the commit objects in the repo, otherwise prune would prune the deltas.

are we talking about the static delta only repo? wouldn't that get rid of the point of not having a bunch of small files in the repo? If we have a master repo where the small files and the static deltas live and then just create static delta only repos by copying content out of that repo then we don't need to worry about this correct?

ramcq commented 6 years ago

be some random ext4 server and not a proper object store

@cgwalters I'm kind of confused by this - what about a filesystem makes it unsuitable for storing/hosting an ostree repo? Is there a more effective backend from which you can store an ostree repo and serve it over http? Or do mirror operators simply dislike having lots of files around?

alexlarsson commented 6 years ago

So, I recently chatted with someone who was running an "app store" about how they implement authorized downloads. Basically what they do is serve the app files on a cdn like cloudfront, and then use a feature like cloudfront secure urls as documented here: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html where they generate the final URL on their server where they know that the logged in user is allowed to download a particular app. The secure URL has a lifetime of 30 seconds and is signed on the server, so the client doesn't have to care and can just download the thing.

In the context of ostree we could do the same thing if we had a delta-only repo on a cdn:

Cloudfront allows you to use cookies for this, but it seems some other CDNs only support http params, so maybe ostree should have a feature similar to --http-header that adds a http param to all urls.

dustymabe commented 5 years ago

@jlebon, @cgwalters, @sinnykumari and I were discussing 'delta-only' repos today. One thing @jlebon brought up was:

  jlebon | @walters @ksinny @dustymabe, just remembered re. static deltas -- those can actually list
         | fallback objects the client should just fetch directly from `objects/`. so we'll have to be
         | careful of that, either also mirroring just those ones (i think they're usually big files), or
         | teach ostree to fetch fallback objects from a separate repo? (edited)
 walters | yeah, i think we need a repo config flag saying it's a delta-only repo
ramcq commented 5 years ago

I've been discussing this stuff with @alexlarsson a lot in the context of Flathub. At one point, the flathub stats were showing each download (whether an upgrade, or a new pull) was averaging 1GB of data transferred - but this was during a period that when ostree didn't see a matching delta it would pull the scratch delta instead of doing an object pull (madness, later resolved).

A delta-only repo is basically re-instating this: mirrors are great and everything, but are a far less relevant way of distributing files than modern caching/proxying CDNs. BunnyCDN (for Endless) and Fastly (for Flathub) work a-OK for ostree repos, and you can easily tune the caching to keep the immutable objects around for ~ever, have short timeouts / explicit purges, its pretty easy to cache ostree repos in CDNs, and the hit rate is superb (>97% in both cases I have access to, likely the two largest production ostree repos at present).

So: what problem is really being solved here? When you look at your CDN bill, or the time and data it costs at the client to have a very limited version of things on the server, I'm really not convinced that unless we make deltas heaps smarter, that a delta only repo is a benefit for clients. It makes mirroring easier, yes - because you have maybe one or a couple of delta folders per ref - but most people don't have a mirror network, so I think it represents a net loss for the bandwidth efficiency of the client, unless we:

cgwalters commented 5 years ago

but this was during a period that when ostree didn't see a matching delta it would pull the scratch delta instead of doing an object pull (madness, later resolved).

Right: https://github.com/ostreedev/ostree/pull/1709

ramcq commented 5 years ago

but this was during a period that when ostree didn't see a matching delta it would pull the scratch delta instead of doing an object pull (madness, later resolved).

Right: #1709

Oh yeah! What I said back then. tl;dr - deltas are an amazing technical advantage of ostree, and (modulo bringing any repo server to its knees when generating them on large files) incredibly smart and bandwidth efficient, but they totally fail to deliver on that promise due to how they are currently deployed and managed. Let's make repo the management tools, ostree/flatpak/repo-manager smarter before we force that ineffectual deployment cost onto our downstream mirrors and every end user by flipping a delta-only bit and not solving the real problem. :)

cgwalters commented 5 years ago

We (FCOS) are discussing this in the context of this issue which links to this MirrorManager one. A concern some people have is tying ourselves solely to a CDN.

ramcq commented 5 years ago

This is the answer you get if you ask mirror operators, of course. :) Provide an OCI image which just opens a caching front-end, and you can deploy your own grass-roots CDN with a geoIP or round robin frontend. Setting low TTLs or issuing PURGE is pretty easy after a summary update. I think if you "solve" this problem (making life easier for mirror operators) it will make things worse for users and undo eg work on delta RPMs etc.

dustymabe commented 5 years ago

concern some people have is tying ourselves solely to a CDN.

for me, I'm not as concerned with tying ourselves to CDN. We've been using a CDN for our ostree repo for a little while now and people still complain about slow download speeds and timeouts all the time. So we either have things configured badly or things are getting cycled out of the cache too fast. See also https://github.com/ostreedev/ostree/issues/1541 where we were discussing one optimization (i.e. the many redirects might be what is slowing down the downloads).

If we can get a good CDN "answer" then i'd be fine with that too

ramcq commented 5 years ago

Oh! Yeah redirects absolutely rinse the performance of whatever pipelining ostree is doing - at least I've definitely seen that at some point early in Flathub's life - that's why we set up dl.flathub.org as a separate hostname for repo access only. You have to point the origin in ostree to the hostname and path served by the CDN - you could probably finesse that with a mirrorlist of one in ostree.

I am almost certain that any Flathub issues are all due to load on the origin server rather than any problem with the CDN. Debian for instance has two CDNs (CloudFront and Fastly) and pays for neither - for Flathub we got Fastly basically by me tweeting, and it wasn't the only offer we received, just one of the best CDNs so I didn't spend much time with the others.

ramcq commented 5 years ago

https://gist.github.com/ramcq/a3991b5834767c6da73eec1af08b52ab is how the origin is configured on Flathub, fwiw.