oss-aspen / 8Knot

Dash app in development to serve open source community visualizations using GitHub data from Augur. Hosted app: https://eightknot.osci.io
MIT License
47 stars 59 forks source link

Upgrade query cache-aside data store #483

Closed JamesKunstle closed 10 months ago

JamesKunstle commented 11 months ago

Current problem

Currently storing data queried from augur as compressed blobs, per (function_name, repo_git) pair in Redis.

This is suboptimal because:

  1. Caching large record in Redis,

  2. Data is compressed, so appending new rows (highly concurrent, green-thread regime) requires decompression, append, compression, write.

  3. Generally trying make Redis work as a cache for tables of data, which it isn't optimized for, so implementation is spaghetti-ish.

Solution

There are a lot of great databases that could be better choices for this particular problem than Redis.

PostgreSQL

The same dbms that we're using for our primary, off-hardware DB. Probably overkill because of the configuration and management challenges that come with such an (as I perceive it) sophisticated piece of software. Nonetheless, could have some Postgres<->Postgres features that would be convenient.

DuckDB

An in-process OLAP database that's pretty hot right now. Is a relational DB but we could mostly ignore that and just store records in unrelated tables. Might be a very good choice because of the power of the query engine: if it can execute analysis queries very quickly, we could offload some analysis tasks that we might do in Pandas to a dedicated query/aggregation engine.

Downside of DuckDB is that it's purely in-process, so there's no server to communicate with over HTTP. That means that for a multi-container app such as ours, we'd have to either write an HTTP wrapper around DuckDB (making it less effective as an in-process DB) or use PVCs to have the database file available to all microservice processes.

Doesn't support multi-read/single-write natively, so we'd probably also have to implement our own access control logic at the application level, which is a strong case against DuckDB.

SQLite

Mostly a repeat of DuckDB, but not optimized for OLAP, instead, for OLTP. Could still be a reasonable choice if concurrency read/write control were available.

Apache Pinot / Apache Druid

Client/Server-style OLAP databases. They have a high minimum resource requirement so even though they solve the HTTP-communication problem of DuckDB and offer the OLAP that would be nice, I think they're probably too hardcore.

Apache Ignite

Don't know too much about this one, but it's on the radar.

MongoDB

This is probably the best candidate. It's nosql, so we don't have to define tables, but it's record-oriented, so we can create 'tables' or 'collections' that are specific to a given query and request specific row IDs out of it based on github repos.

It's also a favorite among web devs, is HTTP (so would probably play nicely with microservices and concurrency).

Redis, but use Sets and Hashes

Instead of compressing the repo data into blobs, simply write all of the rows into a Set in Redis and create an index on the repo as a Hash for O(1) access.

This might be SUPER nice because we could get TTL cache invalidation for free, but I don't think it's as fast as something row-oriented. It would also purely be memory-resident which is more valuable real estate than something like MongoDB which is Disk+Memory caching.

JamesKunstle commented 10 months ago

Decision was to use Postgres.