Serve cached compressed content

kiootic commented 9 months ago

Serve gziped/brotli compressed content, with fixed-size in-memory cache.

kiootic commented 9 months ago

Plan the algorithm/implementation of the cache

toshinari123 commented 9 months ago

func NewIMC(size int) {*IMC, error)
func (cache *IMC) Get(id string) (value, error)
func (cache *IMC) Add(id string, value Value) (error)

the value is the gzipped content and the key is the static file path, it would also use cache busting and append the SHA256 hashed file to the path. For the cache algorithm, it would be a simple least recently used cache.

kiootic commented 9 months ago

We don't need cache busting for this; cache busting is used to invalidate browser's cache, and this cache is not at the browser.
What is the scope of this cache? For example, per-server, per-app, etc.
For LRU cache, any suggestion on how to implement it? Are there ready-to-use libraries that could meet our need?

toshinari123 commented 9 months ago

the scope of this cache is per-server because the fixed size of the cache should depend on the machine (if there are multiple apps and it is per-app the effective total size is multiplied). If implementing from scratch, the cache would consist of a list of keys (ordered in access time) and a key value map. There are also libraries such as https://github.com/hashicorp/golang-lru (simple) and https://github.com/dgraph-io/ristretto (more complex but seems better performance)

toshinari123 commented 9 months ago

I think ristretto is a good choice as it is in-memory and memory bounded. The type of value of cache is bytes.Buffer. The content cache will be added to Handler struct in line 31 of internal/handler/site/handler.go and the cache will be created in func NewHandler. The cache accessing will occur in line 74 if block of internal/handler/site/site_handler.go. if content exist in the cache, lines 86 to 87 will be used to serve the file (reader is replaced with bytes.Reader from the cache). if it does not exist, create a new buffer and use bytes.Buffer as the writer and line 79 to 83 lazyReader as reader. source

kiootic commented 9 months ago

The placement of cache sounds good to me. I'd like consider more on the cache logic:

The cache would be accessed concurrently by multiple requests, how do we control concurrent cache access?
What if the file exceed the max cache size?
If cache missed, how do we populate the cache? We'd compressed the raw content and store it into cache; when do we return the compressed content at first access? I'd like to compress on-fly (i.e. streaming) to reduce memory consumption.
We need to add configuration on the maximum cache size (default to a reasonable size, e.g. 16MiB)
We need to perform negotiation with Accept-Encoding and Content-Encoding header, let's try to support gzip and br for now.

toshinari123 commented 9 months ago

concurrent cache access is handled by ristretto (see "Concurrency and Contention Resistance" of article)
there will be a check if file exceed the max cache size, and if it does it will just be read from file every time (never goes into cache)
instead of bytes.Buffer, maybe it will be io.Writer and directly use gzip.WriteCloser or brotli.Writer
the max cache size config will be added to HandlerConfig
there will be a check before streaming to cache if the Request has the appropriate headers

toshinari123 commented 9 months ago

*typo: it will be io.WriteCloser and directly use gzip.Writer or brotli.Writer

kiootic commented 9 months ago

concurrent cache access is handled by ristretto

Ristretto supports concurrent get & set, but how about the initial cache content? Suppose there are 100 requests for the same uncached file, we'd like to compress the file only once, and the 100 requests should wait on the compression to complete before serving the same compressed content.

there will be a check if file exceed the max cache size, and if it does it will just be read from file every time (never goes into cache)

Does it mean it would never be compressed, or compress from scratch every time the file is requested?

the max cache size config will be added to HandlerConfig

The user should be able to configure using the usual way of configuration (command line flags or environment variables).

You may want to try setup the managed-sites mode for general deployment setup

toshinari123 commented 9 months ago

what about this: make a map of mutexes with the hash as key; on cache miss, it will

lock the corresponding mutex
check if exist in cache
if not, compress and put into cache
unlock mutex

If there are a lot of request for the same file, the first request will trigger a compression and putting into cache, then after unlocking the other request will see the compressed file in cache so it would not need to compress again

if file exceed the max cache size, it would never be compressed. it is most likely some media file that has underwent compression according to the file format already.

kiootic commented 9 months ago

if file exceed the max cache size, it would never be compressed. it is most likely some media file that has underwent compression according to the file format already.

Sounds good to me!

If there are a lot of request for the same file, the first request will trigger a compression and putting into cache, then after unlocking the other request will see the compressed file in cache so it would not need to compress again

If we have an additional map of mutexs, we need to be careful to ensure each mutex has the same lifetime as the actual value. That is, when the value is evicted from cache (maybe due to LRU policy), the mutex is also deleted from the map.

toshinari123 commented 9 months ago

content cache

model type ContentCache [15 minutes]
write func NewCache [15 minutes]
write func get [15 minutes]
write func set [15 minutes]
write func GetContent (uses get, and set if cache miss) [15 minutes]
use ContentCache in site_handler.go (will check if exceed max file size and if Request has Encoding headers) [30 minutes]
add max cache size as flag with viper [30 minutes]

toshinari123 commented 9 months ago

add contentSizeLimit in contentCache(1 hour)
add remove mutex entry when unloading in ristretto cache (1 hour)
use https://pkg.go.dev/github.com/andybalholm/brotli#HTTPCompressor in site_handler (1 hour)
use content cache in site_handler.go serveFile (2 hours)
figure out viper/cobra to add persistent flags in cmd/controller/app/start.go (1 hour)

toshinari123 commented 9 months ago

should i implement compression and contentcache as middlewares instead?

kiootic commented 9 months ago

That sounds good to me. However please note that the cache and the logic that handles caching should be separated, so the cache middleware would add the compressed content to the cache if needed (e.g. according to threshold).

toshinari123 commented 9 months ago

above: done 1, 2, 4, taken 3 hours

convert code written in site_handler serveFile to a cache middleware (2 hours)
write a compression middleware according to https://github.com/go-swiss/compress/blob/main/compress.go or just use that package if possible (2 hours)
test written code (1 hour)
figure out viper/cobra to add persistent flags (1 hour)

kiootic commented 9 months ago

If file size > threshold (about >1M), we may assume it is a compressed asset (e.g. PNG/ZIP). So we just check the raw file size against the threshold before trying to compress & add to cache.

TODO:

Write tests for #15
Refactor into a cache middleware
Add a compression middleware
Test the middlewares works correctly
Add configuration flags

Expected work order: compress -> cache, so need pay attention to the order of applying middleware.

toshinari123 commented 8 months ago

data race test: time box 4 hours
refactor into a cache middleware: 2 hours

kiootic commented 8 months ago

type CacheResponse struct {
    Data []byte
    Headers map[string]string
    // Content-Encoding/Etag/Cache-Control
}

func CacheMiddleware(request, response, next) {
    // 1. Lookup from cache
    // 1.1. Create cache key (file hash, encoding (compression method))
    // ref: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Vary
    // can lookup libraries providing this cache mechanism?
    key := file.Hash
    if request.Header("Accept-Encoding") == "???" {
        key += "???"
    }

    cachedResponse, ok := cache.Lookup(key)
    if ok {
        response.WriteHeaders(cachedResponse.Headers)
        response.Write(cachedResponse.Data)
        return
    }

    next.ServeHTTP(request, response)

    data := response.Data
    headers := response.Headers
    cache.Set(key, CacheResponse {Data, Headers})
}

kiootic commented 8 months ago

TODO:

Lookup libraries providing this cache mechanism? (4)
- https://github.com/victorspringer/http-cache
- Plan & investigate how to use the library
Write cache middleware (maybe using library) (2)
Write test for the middleware (2)

toshinari123 commented 8 months ago

https://github.com/bxcodec/httpcache : uses transport and roundtripper https://github.com/gregjones/httpcache/tree/master : uses transport and roundtripper https://github.com/kataras/httpcache/blob/v0.0.1/service.go (predecessor of cache in Iris web framework) : low number of users https://github.com/victorspringer/http-cache/blob/747df1b7981c68d7218c4e2c6a1a5cf13fd0cccd/cache.go : similar structure, but uses a different way to make the cache key (hash the URL)

conclusion: code the cache while referring to code of last 2 caches

oursky / pageship

Serve cached compressed content #13