picosh / pico

services that empower developers using ssh
https://pico.sh
MIT License
802 stars 27 forks source link

Improving pgs asset serving performance (brainstorming) #149

Open mac-chaffee opened 1 month ago

mac-chaffee commented 1 month ago

Hello! This weekend I was interested in learning how pgs worked so I looked through the code and wrote down any possible places that I thought could affect performance. I didn't do any runtime testing, so take this with a grain of salt.

To serve a single HTML file, the following must happen:

  1. DNS lookup for TXT record to determine project name (findRouteConfig -> GetCustomDomain)
  2. Regex match on all routes (CreateServeBasic)
  3. DB Query 1 (WHERE clause on app_users.name) (FindUserForName)
  4. GetBucket() call to minio
  5. DB Query 2 (WHERE clause on projects.user_id && name) (FindProjectByName)
  6. DB Query 3 (WHERE clause on feature_flags.user_id && name) (HasFeatureForUser)
  7. GetObject() call to minio for _redirects, then parsed
    • (note: redirects may be ignored after minio error)
    • (note: redirect file loaded into RAM without limits)
  8. Regex match on every line of _redirects (calcRoutes)
  9. Three possibilities (handle):
    • If it's a redirect to an external site, it's fetched with http.get() and returned as-is
    • If it's a request to an image, it's fetched from imgproxy with http.get()
    • If it's a request to any other file, it's fetch from minio with GetObject()
  10. GetObject() call to minio for _headers, then parsed
  11. Regex match on every line of _headers
  12. DB Query 4 (WHERE clause on feature_flags.user_id && name) (AnalyticsVisitFromRequest)
  13. Regex match on possible crawlers (trackableRequest -> IsCrawler)
  14. (async) DB Query 5 (INSERT INTO analytics_visits) (AnalyticsCollect)

The following are some ideas for improving performance:

For (1), I predict the DNS lookup is the slowest operation in the list since (I think) all the other operations don't leave your single VM. Are you using Oracle Linux? If it's anything like RHEL, then local DNS caching is not enabled by default. If you enable systemd-resolved, it will cache 4096 responses for up to 2 hours and it will respect the TTL. Users should be encouraged to set high TTLs (>1 hour) to improve performance.

For the database queries (3, 5, 6, 13), an easy win would be to fetch the feature_flags in the same query where we fetch the user, but then we'd still be performing 2 queries per request. Possibly caching is a better solution, see below.

For the GetBucket() call (4), that will send a BucketExists() request to Minio. Technically that's not necessary since you can create Bucket objects using just the name of the bucket.

For the GetObject() calls to read _redirects and _headers, I think caching is our only hope. Caching these would also allow us to cache the compiled regexes.

Caching

Since all of this data is small, I think we could use an in-process, in-memory cache like https://github.com/hashicorp/golang-lru (the 2q implementation sounds smarter since it considers frequency in addition to recency) NVM, if we want to set ttls, we have to use the LRU version. The following work would be required:

  1. Limit _redirects and _headers to something like 5KB so our cache size is bounded.
  2. When a site is first requested, we create something like that AssetHandler struct (plus routes parsed from _redirects and _headers) and save it to the cache, keyed on the user-project slug with a default TTL of something reasonable like 1 hour (so any caching bug we happen to introduce resolves itself in 1 hour).
    • We check for the cached version before creating a new one.
    • We don't cache the TXT record; we rely on systemd-resolved caching since that is designed to respect DNS TTLs. That allows us to quickly determine the user-project slug key that we need for reading from our own cache.
  3. When any of the following happens, we delete our cache entry if it exists:
    • User deleted (or just let it expire)
    • Project deleted (or just let it expire)
    • Project disabled (or just let it expire)
    • User flags change like pico+ or analytics
    • New _redirects or _headers file uploaded (or I guess we could clear the cache on any upload as a user-controlled way of clearing their own cache)

If we do all that, then we can serve assets with a single locally-cached DNS lookup, a single hash table lookup, and a single GetObject() minio call! 🚀

Thoughts? I can contribute for some of this. It's a pretty big change, so just wanted to run it by you before diving too deep.

neurosnap commented 1 month ago

Hi! Wow, thanks for this in-depth feeback!

So I like all of these changes and can commit to getting this across the finish line.

There are a couple of pieces that require us to help:

5kb seems perfectly reasonable to me so let's go for it.

For caching, what do you think about the ability to wipe the in-memory cache? In particular:

Do you have any thoughts on supporting those features?

Finally, let me know what parts you'd like to work on and I can commit to whatever else.

Thanks again, this is awesome!

mac-chaffee commented 3 weeks ago

Some good discussion in this PR: https://github.com/picosh/pico/pull/154

mac-chaffee commented 3 weeks ago

antoniomika and I have had a chat and because of the way we record analytics and consider it a first-class feature at pico, we are leaning towards implementing the caching mechanism in our app code, although we can be convinced otherwise if there's a better way to outsource http caching (which is a solved problem that doesn't need re-inventing).

One thing about the analytics that worries me is that they're going into an unindexed postgres table with no expiration mechanism. Checking analytics for my site takes 3+ seconds already, which doesn't leave a lot of room for adding new features on top of analytics. That's in addition to complicating caching by making all requests hit the origin server to be counted.

I'm wondering if we should be rebuilding analytics to work with caching rather than building caching around the current analytics implementation.

The main change would be "pulling" analytics from whatever is doing the caching (either a CDN or a distributed cache storage system) instead of making the caching system "push" view counts to pgs. Could also use the opportunity to store the metrics in a time series database. Queries like "what time of day do I get the most traffic?" are a lot more efficient in a time series database.

The specifics would depend on which method you pick for caching. But you could make that choice free from the burden of working around the current analytics system. If you do decide to build analytics around the cache, then we could just disable caching for .html files as a first step as we build the caching system.

Just thinking out loud, no strong opinions. My main goal is ensuring you're happy with the resulting setup :)

neurosnap commented 3 weeks ago

One thing about the analytics that worries me is that they're going into an unindexed postgres table with no expiration mechanism. Checking analytics for my site takes 3+ seconds already, which doesn't leave a lot of room for adding new features on top of analytics. That's in addition to complicating caching by making all requests hit the origin server to be counted.

Thanks for the reminder! This got us from 3+ seconds to around 1 second: https://github.com/picosh/pico/commit/ac3be1722d7dbfb69fe98697af2bd071fa3dbf0c

I'm wondering if we should be rebuilding analytics to work with caching rather than building caching around the current analytics implementation.

I love this thinking. To provide some context, the reason why we built our own analytics system is because we wanted to be in full control of how the data gets aggregated and stored without using or being up-sold on something off-the-shelf. We have positioned ourselves to host privacy-focused services and all the other self-hosted solutions that I found felt very heavy. It didn't pass the BYO test for me.

Queries like "what time of day do I get the most traffic?" are a lot more efficient in a time series database.

I'm definitely open to converting our analytics table to use timescaleDB and I did investigate it when building but decided against it mainly because of the complexity. I do think there are other things we can do with a vanilla db table that we can try before switching over (e.g. partitioned table). However, that doesn't change our push/pull mechanism.

If we went with Souin or another cache system off-the-shelf, how could we record site-usage analytics using pull?

mac-chaffee commented 3 weeks ago

Wow the analytics are a heck of a lot faster now! Thanks! 🚀

We have positioned ourselves to host privacy-focused services

Thanks for clarifying, that makes sense (and is one of the reasons I moved my site to pgs!). I think that means we can cross major CDN providers off the list (using tuns instead of Cloudflare Tunnels is another reason I signed up).

I also have doubts about http-cache in the context of multi-regional pgs due to the cache invalidation issue.

I can think of four possible options that involve rebuilding analytics around caching:

Option 1: Scrape Prometheus metrics from Souin

If we use Souin, we'd need a new feature added for obtaining page view counts. I'd bet darkweak would want this feature to be generic enough for others to use, and the best idea I can think of is to expand the existing prometheus metrics to include hostname and route labels (disabled by default, enableable via config). Seems you're already collecting some prometheus metrics for pgs.

Pros:

Cons:

Lack of ability to filter out bots may make this option a non-starter.

Option 2: Souin + no-cache + conditional GET requests for HTML files

This option involves allowing Souin to cache all files except HTML files. For HTML files, we return the following headers:

Then we add code to pgs that returns a 304 Not Modified response, while also asynchronously inserting the page view into postgres like normal.

Pgs would probably need it's own kind of cache+TTL thing, like a hashmap of Etags so it can quickly return 304s. This hashmap would of course need to be purged somehow from pgs-ssh.

Pros:

Cons:

Option 3: JavaScript Beacon-based analytics

Something at the back of my mind in the discussion of analytics is that really the best kind of caching is client-side caching (I set Cache-Control: max-age=1800 for all my site's static assets), but that completely breaks analytics. That's why most analytics providers use a little JavaScript with sendBeacon() to count page views. This can still respect privacy, see https://www.goatcounter.com/

Pros:

Cons:

Option 4: Use Souin and parse Caddy access logs

Apparently this is how Cloudflare's analytics works when you don't use their JavaScript.

You'd need the following changes:

Pros:

Cons:

I'm leaning toward option 4, but what do you think? Are there other options?

mac-chaffee commented 3 weeks ago

Btw I updated my branch with some of things discussed: https://github.com/picosh/pico/compare/main...mac-chaffee:pico:caddy-caching

neurosnap commented 1 week ago

Hey @mac-chaffee

Sorry for the long delay, we had some other pico work that we thought we could fold nicely into your caching work. In particular, we have deployed a change to how we collect site usage analytics. We are now using pipe to collect these metrics (we are calling it a metric-drain).

I think you are right, we should go with option (4).

Here's a patchset that I'm prototyping to adapt our metric drain to receive caddy logs: https://pr.pico.sh/prs/35

Once that is complete all we need to do is figure out how to send the logs from caddy to pipe in a resilient way. You can let us think about that.

For the sake of argument, let's say we have a solution for caddy and our metric-drain, what's left for us to start using your contributions?

mac-chaffee commented 1 week ago

No worries!

You all are really serious about dog-fooding haha, that sounds like a cool solution.

For my own understanding, have you decided roughly what the multi-region setup would look like? Would this be like hub-and-spoke where all the stateful stuff (database, minio) lives on a single "hub" server with regional "spokes" that run caddy+pgs/pico?

mac-chaffee commented 1 week ago

what's left for us to start using your contributions?

To answer the question, honestly I think I'd just have to rebase my branch and try it out! When the metrics-drain+caddy logs solution is ready, you'd just delete the line that disables caching for html files: r.Header.Set("cache-control", "no-cache").