Open mac-chaffee opened 1 month ago
Hi! Wow, thanks for this in-depth feeback!
So I like all of these changes and can commit to getting this across the finish line.
There are a couple of pieces that require us to help:
_redirects
and _headers
5kb seems perfectly reasonable to me so let's go for it.
For caching, what do you think about the ability to wipe the in-memory cache? In particular:
Do you have any thoughts on supporting those features?
Finally, let me know what parts you'd like to work on and I can commit to whatever else.
Thanks again, this is awesome!
Some good discussion in this PR: https://github.com/picosh/pico/pull/154
antoniomika and I have had a chat and because of the way we record analytics and consider it a first-class feature at pico, we are leaning towards implementing the caching mechanism in our app code, although we can be convinced otherwise if there's a better way to outsource http caching (which is a solved problem that doesn't need re-inventing).
One thing about the analytics that worries me is that they're going into an unindexed postgres table with no expiration mechanism. Checking analytics for my site takes 3+ seconds already, which doesn't leave a lot of room for adding new features on top of analytics. That's in addition to complicating caching by making all requests hit the origin server to be counted.
I'm wondering if we should be rebuilding analytics to work with caching rather than building caching around the current analytics implementation.
The main change would be "pulling" analytics from whatever is doing the caching (either a CDN or a distributed cache storage system) instead of making the caching system "push" view counts to pgs. Could also use the opportunity to store the metrics in a time series database. Queries like "what time of day do I get the most traffic?" are a lot more efficient in a time series database.
The specifics would depend on which method you pick for caching. But you could make that choice free from the burden of working around the current analytics system. If you do decide to build analytics around the cache, then we could just disable caching for .html files as a first step as we build the caching system.
Just thinking out loud, no strong opinions. My main goal is ensuring you're happy with the resulting setup :)
One thing about the analytics that worries me is that they're going into an unindexed postgres table with no expiration mechanism. Checking analytics for my site takes 3+ seconds already, which doesn't leave a lot of room for adding new features on top of analytics. That's in addition to complicating caching by making all requests hit the origin server to be counted.
Thanks for the reminder! This got us from 3+ seconds to around 1 second: https://github.com/picosh/pico/commit/ac3be1722d7dbfb69fe98697af2bd071fa3dbf0c
I'm wondering if we should be rebuilding analytics to work with caching rather than building caching around the current analytics implementation.
I love this thinking. To provide some context, the reason why we built our own analytics system is because we wanted to be in full control of how the data gets aggregated and stored without using or being up-sold on something off-the-shelf. We have positioned ourselves to host privacy-focused services and all the other self-hosted solutions that I found felt very heavy. It didn't pass the BYO test for me.
Queries like "what time of day do I get the most traffic?" are a lot more efficient in a time series database.
I'm definitely open to converting our analytics table to use timescaleDB and I did investigate it when building but decided against it mainly because of the complexity. I do think there are other things we can do with a vanilla db table that we can try before switching over (e.g. partitioned table). However, that doesn't change our push/pull mechanism.
If we went with Souin or another cache system off-the-shelf, how could we record site-usage analytics using pull?
Wow the analytics are a heck of a lot faster now! Thanks! 🚀
We have positioned ourselves to host privacy-focused services
Thanks for clarifying, that makes sense (and is one of the reasons I moved my site to pgs!). I think that means we can cross major CDN providers off the list (using tuns instead of Cloudflare Tunnels is another reason I signed up).
I also have doubts about http-cache
in the context of multi-regional pgs due to the cache invalidation issue.
I can think of four possible options that involve rebuilding analytics around caching:
If we use Souin, we'd need a new feature added for obtaining page view counts. I'd bet darkweak would want this feature to be generic enough for others to use, and the best idea I can think of is to expand the existing prometheus metrics to include hostname and route labels (disabled by default, enableable via config). Seems you're already collecting some prometheus metrics for pgs.
Pros:
topk(10, sum by (route) (souin_total_request_counter{host="www.example.com"}))
Cons:
num_sites * num_users * num_files * num_servers * num_other_labels
)Lack of ability to filter out bots may make this option a non-starter.
no-cache
+ conditional GET requests for HTML filesThis option involves allowing Souin to cache all files except HTML files. For HTML files, we return the following headers:
Cache-Control: no-cache
to prevent Souin from caching the HTML filesEtag: <hash>
so browsers will perform a conditional GET request (If-None-Match
) for the HTML fileThen we add code to pgs that returns a 304 Not Modified
response, while also asynchronously inserting the page view into postgres like normal.
Pgs would probably need it's own kind of cache+TTL thing, like a hashmap of Etags so it can quickly return 304s. This hashmap would of course need to be purged somehow from pgs-ssh
.
Pros:
Cons:
Something at the back of my mind in the discussion of analytics is that really the best kind of caching is client-side caching (I set Cache-Control: max-age=1800
for all my site's static assets), but that completely breaks analytics. That's why most analytics providers use a little JavaScript with sendBeacon() to count page views. This can still respect privacy, see https://www.goatcounter.com/
Pros:
Cons:
Apparently this is how Cloudflare's analytics works when you don't use their JavaScript.
You'd need the following changes:
Pros:
Cons:
I'm leaning toward option 4, but what do you think? Are there other options?
Btw I updated my branch with some of things discussed: https://github.com/picosh/pico/compare/main...mac-chaffee:pico:caddy-caching
Hey @mac-chaffee
Sorry for the long delay, we had some other pico work that we thought we could fold nicely into your caching work. In particular, we have deployed a change to how we collect site usage analytics. We are now using pipe
to collect these metrics (we are calling it a metric-drain
).
I think you are right, we should go with option (4).
Here's a patchset that I'm prototyping to adapt our metric drain to receive caddy logs: https://pr.pico.sh/prs/35
Once that is complete all we need to do is figure out how to send the logs from caddy to pipe
in a resilient way. You can let us think about that.
For the sake of argument, let's say we have a solution for caddy and our metric-drain
, what's left for us to start using your contributions?
No worries!
You all are really serious about dog-fooding haha, that sounds like a cool solution.
For my own understanding, have you decided roughly what the multi-region setup would look like? Would this be like hub-and-spoke where all the stateful stuff (database, minio) lives on a single "hub" server with regional "spokes" that run caddy+pgs/pico?
what's left for us to start using your contributions?
To answer the question, honestly I think I'd just have to rebase my branch and try it out! When the metrics-drain+caddy logs solution is ready, you'd just delete the line that disables caching for html files: r.Header.Set("cache-control", "no-cache")
.
Hello! This weekend I was interested in learning how pgs worked so I looked through the code and wrote down any possible places that I thought could affect performance. I didn't do any runtime testing, so take this with a grain of salt.
To serve a single HTML file, the following must happen:
app_users.name
) (FindUserForName)projects.user_id && name
) (FindProjectByName)feature_flags.user_id && name
) (HasFeatureForUser)_redirects
, then parsed_redirects
(calcRoutes)_headers
, then parsed_headers
feature_flags.user_id && name
) (AnalyticsVisitFromRequest)analytics_visits
) (AnalyticsCollect)The following are some ideas for improving performance:
For (1), I predict the DNS lookup is the slowest operation in the list since (I think) all the other operations don't leave your single VM. Are you using Oracle Linux? If it's anything like RHEL, then local DNS caching is not enabled by default. If you enable systemd-resolved, it will cache 4096 responses for up to 2 hours and it will respect the TTL. Users should be encouraged to set high TTLs (>1 hour) to improve performance.
For the database queries (3, 5, 6, 13), an easy win would be to fetch the
feature_flags
in the same query where we fetch the user, but then we'd still be performing 2 queries per request. Possibly caching is a better solution, see below.For the GetBucket() call (4), that will send a BucketExists() request to Minio. Technically that's not necessary since you can create Bucket objects using just the name of the bucket.
For the GetObject() calls to read
_redirects
and_headers
, I think caching is our only hope. Caching these would also allow us to cache the compiled regexes.Caching
Since all of this data is small, I think we could use an in-process, in-memory cache like https://github.com/hashicorp/golang-lru
(the 2q implementation sounds smarter since it considers frequency in addition to recency)NVM, if we want to set ttls, we have to use the LRU version. The following work would be required:_redirects
and_headers
to something like 5KB so our cache size is bounded.AssetHandler
struct (plus routes parsed from _redirects and _headers) and save it to the cache, keyed on theuser-project
slug with a default TTL of something reasonable like 1 hour (so any caching bug we happen to introduce resolves itself in 1 hour).user-project
slug key that we need for reading from our own cache._redirects
or_headers
file uploaded (or I guess we could clear the cache on any upload as a user-controlled way of clearing their own cache)If we do all that, then we can serve assets with a single locally-cached DNS lookup, a single hash table lookup, and a single GetObject() minio call! 🚀
Thoughts? I can contribute for some of this. It's a pretty big change, so just wanted to run it by you before diving too deep.