osquery / foundation

osquery Foundation Charter, Legal, and Process Documents
12 stars 7 forks source link

The search for package hosting #80

Open directionless opened 2 years ago

directionless commented 2 years ago

Our package downloads are currently running around 70TB/month. And this is growing. This is fairly expensive to me, personally, so I've been slowly trying to find a better answer.

directionless commented 2 years ago

I captured the cloudfront logs for most of a day, and did some analysis. there are a very small number of IP addresses that account for the vast majority of our served traffic. Most of these, are in AWS. And they look like they keep requesting the same deb files. I would bet these are VPC gateways to large sites with frequent or ephemeral node creation.

If we recognize that most of our traffic is within AWS, then we can plan hosting around that. Notably, transfer from S3 to an endpoint within the same region is free. This leads to a possible solution.

We create a bucket per region, replicate our package content, and then bounce users to these direct s3 URLs. There are, of course, implementation questions...

Possible Implementations: What Pro Con
Lambda Lots of control; probably inexpensive Served from US; need to write it
Cloud Front Function Easily fits what we're doing today limited functionality
Lambda @ Edge Global need to write it; limited languages
S3 MRAP AWS implementation of what I'd write Unclear if custom URLs are supported; unclear price
Route53 Geo Load Balancing Simple Unclear if we can do the custom URLs for s3
Just use s3 from us-east-1 Simple Need to rename bucket; can't easily do other regions

I suspect that the best route for me, is to write the lambda. It gets us off cloudfront, go is supported, etc. But, that's going to take more more than a couple days. So as a stop gap, I've gone and implemented a CloudFront function to redirect users from our top 15 IP addresses to the bucket closest to them.

Getting to that, has a bunch of other pieces:

  1. There are now 3 additional buckets serving packages
  2. S3 is configured to replicate between them
  3. Cloudfront has a viewer request function to generate URLs for the top N users
  4. I enabled s3 storage metrics, they look pretty cheap
directionless commented 2 years ago

Another approach here, is to use github. There is some prior art in making an apt repo in a github pages:

mike-myers-tob commented 2 years ago

CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy

GitHub Packages is free for OSS. Can we use this? https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages ah....maybe only for Chocolatey / NuGet https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages#supported-clients-and-formats

directionless commented 2 years ago

CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy

Always good to have more options! They're quoted bandwidth caps are orders of magnitude below our current bandwidth. (this is also true for packagecloud, and the various other places I've seen). This may be different if my S3 redirects pan out.

GitHub Packages is free for OSS. Can we use this? https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages ah....maybe only for Chocolatey / NuGet https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages#supported-clients-and-formats

I want this to be useful. But the supported repos are very lacking.

mike-myers-tob commented 2 years ago

CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy

Always good to have more options! They're quoted bandwidth caps are orders of magnitude below our current bandwidth. (this is also true for packagecloud, and the various other places I've seen). This may be different if my S3 redirects pan out.

Yes, but the way I read it, those limits are the guaranteed for any open-source project, and if you accept a sponsorship agreement of some kind there would be other unstated limits (perhaps negotiable up to where we need them). Not sure if osquery is high-profile enough for them but we could ask.

We definitely want to get this hosting bill off your wallet, it's not sustainable and is kind of an existential risk to the project if you suddenly have to cut out.

I think we want to encourage these top downloaders to manage their own package cache. We could write a tutorial that explains how to create private mirrors of package repos so that they can point their ephemeral VMs to that instead of constantly re-downloading from our S3 bucket. Maybe we can even pitch it as a cost saver for them too, if it means less inbound network traffic cost to them. What if we could rate limit by IP address or IP range, whichever would be effective? Not for everyone, just for the repeat downloaders. Eventually they will notice that they should be caching.

robbat2 commented 2 years ago

Wondering if you have more stats:

I'll send an email to @directionless making some introductions.

directionless commented 2 years ago

I'm not great at updating tickets...

I thought a bit, and realized that something seemed very off. The number of requests we see (about 2.5 million/day) is very very high for osquery packaging. So I went digging into the actual users of the data. Is it actually credible that we see ~30 computers a second download osquery?

Anyhow, I discovered that there is a single consumer that is responsible for the vast bulk of the traffic. I don't know much about it, other than it's an VPC in AWS us-east-1, and they keep downloading the ubuntu x86 package.

Because AWS in-zone S3 data transfer is free, this leads to a simple solution. For the busy clients, we can redirect them directly to the bucket. I went and setup some redirect magic in cloud-front to bounce the top 10 AWS ips to direct bucket links. Our monthly bill is now much more manageable. Though probably still a bit higher than desired.

I think I can get it even lower, by moving my redirecting project into lambda, and completely moving away from cloud-front.