picosh / pico

hacker labs - open source and managed web services leveraging SSH
https://pico.sh
MIT License
791 stars 28 forks source link

Improving pgs asset serving performance (brainstorming) #149

Open mac-chaffee opened 1 week ago

mac-chaffee commented 1 week ago

Hello! This weekend I was interested in learning how pgs worked so I looked through the code and wrote down any possible places that I thought could affect performance. I didn't do any runtime testing, so take this with a grain of salt.

To serve a single HTML file, the following must happen:

  1. DNS lookup for TXT record to determine project name (findRouteConfig -> GetCustomDomain)
  2. Regex match on all routes (CreateServeBasic)
  3. DB Query 1 (WHERE clause on app_users.name) (FindUserForName)
  4. GetBucket() call to minio
  5. DB Query 2 (WHERE clause on projects.user_id && name) (FindProjectByName)
  6. DB Query 3 (WHERE clause on feature_flags.user_id && name) (HasFeatureForUser)
  7. GetObject() call to minio for _redirects, then parsed
    • (note: redirects may be ignored after minio error)
    • (note: redirect file loaded into RAM without limits)
  8. Regex match on every line of _redirects (calcRoutes)
  9. Three possibilities (handle):
    • If it's a redirect to an external site, it's fetched with http.get() and returned as-is
    • If it's a request to an image, it's fetched from imgproxy with http.get()
    • If it's a request to any other file, it's fetch from minio with GetObject()
  10. GetObject() call to minio for _headers, then parsed
  11. Regex match on every line of _headers
  12. DB Query 4 (WHERE clause on feature_flags.user_id && name) (AnalyticsVisitFromRequest)
  13. Regex match on possible crawlers (trackableRequest -> IsCrawler)
  14. (async) DB Query 5 (INSERT INTO analytics_visits) (AnalyticsCollect)

The following are some ideas for improving performance:

For (1), I predict the DNS lookup is the slowest operation in the list since (I think) all the other operations don't leave your single VM. Are you using Oracle Linux? If it's anything like RHEL, then local DNS caching is not enabled by default. If you enable systemd-resolved, it will cache 4096 responses for up to 2 hours and it will respect the TTL. Users should be encouraged to set high TTLs (>1 hour) to improve performance.

For the database queries (3, 5, 6, 13), an easy win would be to fetch the feature_flags in the same query where we fetch the user, but then we'd still be performing 2 queries per request. Possibly caching is a better solution, see below.

For the GetBucket() call (4), that will send a BucketExists() request to Minio. Technically that's not necessary since you can create Bucket objects using just the name of the bucket.

For the GetObject() calls to read _redirects and _headers, I think caching is our only hope. Caching these would also allow us to cache the compiled regexes.

Caching

Since all of this data is small, I think we could use an in-process, in-memory cache like https://github.com/hashicorp/golang-lru (the 2q implementation sounds smarter since it considers frequency in addition to recency) NVM, if we want to set ttls, we have to use the LRU version. The following work would be required:

  1. Limit _redirects and _headers to something like 5KB so our cache size is bounded.
  2. When a site is first requested, we create something like that AssetHandler struct (plus routes parsed from _redirects and _headers) and save it to the cache, keyed on the user-project slug with a default TTL of something reasonable like 1 hour (so any caching bug we happen to introduce resolves itself in 1 hour).
    • We check for the cached version before creating a new one.
    • We don't cache the TXT record; we rely on systemd-resolved caching since that is designed to respect DNS TTLs. That allows us to quickly determine the user-project slug key that we need for reading from our own cache.
  3. When any of the following happens, we delete our cache entry if it exists:
    • User deleted (or just let it expire)
    • Project deleted (or just let it expire)
    • Project disabled (or just let it expire)
    • User flags change like pico+ or analytics
    • New _redirects or _headers file uploaded (or I guess we could clear the cache on any upload as a user-controlled way of clearing their own cache)

If we do all that, then we can serve assets with a single locally-cached DNS lookup, a single hash table lookup, and a single GetObject() minio call! 🚀

Thoughts? I can contribute for some of this. It's a pretty big change, so just wanted to run it by you before diving too deep.

neurosnap commented 1 week ago

Hi! Wow, thanks for this in-depth feeback!

So I like all of these changes and can commit to getting this across the finish line.

There are a couple of pieces that require us to help:

5kb seems perfectly reasonable to me so let's go for it.

For caching, what do you think about the ability to wipe the in-memory cache? In particular:

Do you have any thoughts on supporting those features?

Finally, let me know what parts you'd like to work on and I can commit to whatever else.

Thanks again, this is awesome!