Closed ehelms closed 4 years ago
Hah, so the plan is "CDN all the things", or, slightly more elaborate:
deb
), monitoring what falls overby-hash
support in freight
so that deb
can use better caches, enable it and CDN debian tooget @evgeni access to RackSpace to investigate logging to their S3 backend
@evgeni got :envelope_with_arrow:.
investigate if we can have by-hash support in freight so that deb can use better caches, enable it and CDN debian too
Unfortunately the answer here is only "pull requests welcome".
Unfortunately the answer here is only "pull requests welcome".
Or suggest a tool that solves our issues and supports by-hash.
Or suggest a tool that solves our issues and supports by-hash.
Yes, I wouldn't be against replacing freight with, let's say, pulp-deb.
Maybe, some day, we could use katello to publish katello, /me dreams.
Ohai :)
Can Aptly solve the by-hash function? It's got a decent API, so worth a look. The API supports upload too (I've been using it's API to directly upload packages from my home Jenkins server, no rsync needed) which could simplify our current setup, perhaps.
I've started actually working on this. The logging config seems to be straight forward, but we need Fastly to enable the S3-based logging for our account (which I did request in a ticket to them).
Okay, I've got basic logging for stagingdeb working, and would like to discuss a few design things before continuing.
cloudfiles:
- name: stagingdeb logging
access_key: <key>
bucket_name: fastly
format: '%h %l %u %t "%r" %>s %b'
format_version: '2'
gzip_level: '0'
message_type: classic
path: "/stagingdeb/"
period: '3600'
placement:
public_key:
response_condition: error log
timestamp_format: "%Y-%m-%dT%H:%M:%S.000"
user: <user>
@mmoll @ekohl @ehelms @GregSutcliffe what do you think?
Fastly sends logs from multiple endpoints which results in multiple logfiles created for each time period. My initial idea was to have one bucket for all subdomains, but I think having one bucket per subdomain would be easier.
:+1:
Currently the logs are "rotated" every hour (that's the default), we could increase to 24h, but I don't think this is necessary (we pay for storage and download-traffic only, not per-file).
:+1:
Currently we only log requests that produced http codes 400 to 600. I don't think we need full access logs?
Stats how many people use something could be useful. @GregSutcliffe has looked at downloaded plugins (even though it's not fully accurate).
The logs can be compressed (they aren't at the moment) and encrypted (same). I don't see much value in encryption. Compression I'd like to evaluate for a few days, when we collected a few more logs.
If there are IPs in there we should think about that privacy aspect.
The logs currently include the clients IP address, yes. We can get rid of them. Or we could replace them with one of the geo-based vars Fastly provides. I didn't see any crypto-pan or similar in the docs.
(I'd probably just drop them, or replace with 0.0.0.0 so that common parsers still can parse the logs)
I've configured error logging for both downloads and stagingdeb to log in a separate bucket, but did not implement any anonymization yet as I've not heard any :+1: or :-1: for that, and without anonymization I did not like to have full access logs stored for now.
I'd vote for :+1: to anonymization, I think 0.0.0.0 sounds like a plan to me.
For the RSS feed (different vhost) we do count the individual IPs to gather some statistics about installs. @GregSutcliffe might have some more insight into what kind of things are useful. IMHO country code level logging is pretty useful at least to see where the various installs are located. This can be used to plan meetups.
Okay, next iteration.
We now have four containers/buckets:
The error logs get... the error logs, non-anonymized The normal logs get the access logs, with the IP replaced by the country code.
What is left to close this out?
@ehelms I've just opened https://github.com/theforeman/foreman-infra/pull/1030 which adds the current live config (minus secrets). if that's deemed OK and merged, we can go ahead, add new endpoints using that playbook and switch the next services.
next step is switching yum.theforeman.org
: https://github.com/theforeman/foreman-infra/pull/1041
Yepp, but the DNS isn't changed yet as Ohad was out of town.
Will that close out this issue? If not, whats next step(s) ?
I think to fully close this one, I'd at least also move the Debian archive behind the CDN. That should account for all our big endpoints.
DNS updated, I see traffic, all good :)
So far Fastly has served 2.5M requests, totalling in ~230GB data. There are ~40k 404 errors which I'll look into in more detail later.
There is still only a ~10% cache hit ratio. The 404s might have an impact on that, but it does mean we save "only" 23GB out of that 230GB.
It will be interesting to see what happens after a release of Foreman if there is a higher hit ratio due to a lot of upgrading users.
Quite a bit (~20% of the errors I looked at) of the 404 are against latest/el6
and nightly/el6
, which don't exist for quite a while now.
But yeah, the numbers will become interesting when there is a release to be downloaded.
@evgeni I think we can close this now, right?
YeS!
@evgeni What's the plan to proceed here? :)