Improving pgs asset serving performance (brainstorming)

Hello! This weekend I was interested in learning how pgs worked so I looked through the code and wrote down any possible places that I thought could affect performance. I didn't do any runtime testing, so take this with a grain of salt.

To serve a single HTML file, the following must happen:

DNS lookup for TXT record to determine project name (findRouteConfig -> GetCustomDomain)
Regex match on all routes (CreateServeBasic)
DB Query 1 (WHERE clause on app_users.name) (FindUserForName)
GetBucket() call to minio
DB Query 2 (WHERE clause on projects.user_id && name) (FindProjectByName)
DB Query 3 (WHERE clause on feature_flags.user_id && name) (HasFeatureForUser)
GetObject() call to minio for _redirects, then parsed
- (note: redirects may be ignored after minio error)
- (note: redirect file loaded into RAM without limits)
Regex match on every line of _redirects (calcRoutes)
Three possibilities (handle):
- If it's a redirect to an external site, it's fetched with http.get() and returned as-is
- If it's a request to an image, it's fetched from imgproxy with http.get()
- If it's a request to any other file, it's fetch from minio with GetObject()
GetObject() call to minio for _headers, then parsed
Regex match on every line of _headers
DB Query 4 (WHERE clause on feature_flags.user_id && name) (AnalyticsVisitFromRequest)
Regex match on possible crawlers (trackableRequest -> IsCrawler)
(async) DB Query 5 (INSERT INTO analytics_visits) (AnalyticsCollect)

The following are some ideas for improving performance:

For (1), I predict the DNS lookup is the slowest operation in the list since (I think) all the other operations don't leave your single VM. Are you using Oracle Linux? If it's anything like RHEL, then local DNS caching is not enabled by default. If you enable systemd-resolved, it will cache 4096 responses for up to 2 hours and it will respect the TTL. Users should be encouraged to set high TTLs (>1 hour) to improve performance.

For the database queries (3, 5, 6, 13), an easy win would be to fetch the feature_flags in the same query where we fetch the user, but then we'd still be performing 2 queries per request. Possibly caching is a better solution, see below.

For the GetBucket() call (4), that will send a BucketExists() request to Minio. Technically that's not necessary since you can create Bucket objects using just the name of the bucket.

For the GetObject() calls to read _redirects and _headers, I think caching is our only hope. Caching these would also allow us to cache the compiled regexes.

Caching

Since all of this data is small, I think we could use an in-process, in-memory cache like https://github.com/hashicorp/golang-lru ~~(the 2q implementation sounds smarter since it considers frequency in addition to recency)~~ NVM, if we want to set ttls, we have to use the LRU version. The following work would be required:

Limit _redirects and _headers to something like 5KB so our cache size is bounded.
When a site is first requested, we create something like that AssetHandler struct (plus routes parsed from _redirects and _headers) and save it to the cache, keyed on the user-project slug with a default TTL of something reasonable like 1 hour (so any caching bug we happen to introduce resolves itself in 1 hour).
- We check for the cached version before creating a new one.
- We don't cache the TXT record; we rely on systemd-resolved caching since that is designed to respect DNS TTLs. That allows us to quickly determine the user-project slug key that we need for reading from our own cache.
When any of the following happens, we delete our cache entry if it exists:
- User deleted (or just let it expire)
- Project deleted (or just let it expire)
- Project disabled (or just let it expire)
- User flags change like pico+ or analytics
- New _redirects or _headers file uploaded (or I guess we could clear the cache on any upload as a user-controlled way of clearing their own cache)

If we do all that, then we can serve assets with a single locally-cached DNS lookup, a single hash table lookup, and a single GetObject() minio call! 🚀

Thoughts? I can contribute for some of this. It's a pretty big change, so just wanted to run it by you before diving too deep.

picosh / pico

Improving pgs asset serving performance (brainstorming) #149

Caching