switch caching to use Postgres from Redis

JamesKunstle commented 9 months ago

switches all queries over to paginate results and store in Postgres tables

implements cache_facade helper functions

adds db_init.py startup script- runs once at boot-time of compose

adds postgres container to docker compose.

moves assignment query preprocessing out of query function and into 8Knot/pages/utils/preprocessing_utils.py.

JamesKunstle commented 9 months ago

Still need to: (1) add cntrb truncation to queries (2) make sure all time stamps are < now. (3) update documentation for use of Postgres (4) rebase recent changes into Postgres regime.

Creating this PR for process visibility early on.

JamesKunstle commented 9 months ago

@cdolfi This is obviously a gigantic PR, so here's a walk-through guide to how I'd imagine it'd be most convenient to review everything. I'm listing the big changes so you can check them out in order before looking at the mostly-the-same changes to query / analysis worker code.

We're adding a "postgresql" container into the compose. This updates 'docker-compose.yaml.' We also add a 'postgres-cache' volume that gets mounted to the "postgresql" container so it has some dedicated disk space to store data.
To set up the postgres db for our application, I wrote a simple script in "8knot/cache-manager/db_init.py" that runs when the compose is booted. It tries to create a database called "augur_cache" in postgres and then create the tables in the cache that we need. If everything goes well it exits with a 'success' code (0) and doesn't restart. Otherwise (exit code 1) it restarts.
Postgres is configured in 'postgres.conf'. All write-ahead-logging is turned off, as are most data durability settings, because we can regenerate the data from Augur without any problems. We're optimizing for read/write speed here.
All database interactions that the application has to do (query and analysis workers) is handled by the facade written in "8Knot/cache_manager/cache_facade.py".
All queries are executed by logic in the cache_facade. Only the query SQL has to live in the query functions. The '%s' wildcard is where the repolists and any other arguments are passed in.
Because we're downloading data from Postgres and immediately storing it in another Postgres instance, all query pre-processing is moved directly into the query, such as shortening IDs to the first 15 characters and checking that the date isn't in the future.
All background callbacks also use the cache_facade logic. If data isn't available, they wait for it to become available by repeatedly checking the "cache_bookkeeping" table, which has rows of (cache_function, repo_id, cached_at_timestamp).

There are still things to work on after this, like:

incorporating recently merged visualizations and switching queries over
documenting the entire Postgres flow (to be done after this is merged in case things change)
improving record retrieval from Augur by adding query retrieval parallelism.

I think it should also be looked at by @codekow and @jberkus

JamesKunstle commented 8 months ago

Ill probably go through this a couple more times throughout the week to look at things with fresh eyes. My POV for my first round(s) of review is going to be from the "data scientist" / "visualization maker" persona of things being compartmentalized where understanding of the caching system is not necessary, just enough to understand how to get their query data and make a new query if necessary. Im going to purposely only look at the queries and visualizations for now until we get to a place that we like that side then ill review the more dense caching code.

On the documentation side: we might want to merge this into a different branch on 8knot and get all of the documentation up to date first before going to dev

Also, how is the speed of the edited queries? Many times the choices I made of doing things in pandas vs sql was from efficiency

(1) Yeah this PR is targeting a new branch. We'll want to get everything from the last couple of weeks nicely merged into this branch before we swap it out into dev.

(2) The edited queries aren't noticeably slower with the additional filtering and transformation steps. They're applied "at projection time" rather than "at query time" in DBMS terms, so they're relatively fast. This new regime sorta requires that we do things in SQL rather than Pandas when we're grabbing the data from Augur, because we're no longer using df's as an intermediate data store- we're transferring records directly from Postgres into Postgres.

JamesKunstle commented 8 months ago

@cdolfi I responded to the comments you made- will make requested changes later.

JamesKunstle commented 8 months ago

@sgoggins could you look at how I configure the postgres database in postgres.conf? I'm configuring based on the container being scheduled with 4GB

mscherer commented 8 months ago

The PR kinda show there is a huge potentail to refactor around the methods that use caching (exemple create_top_k_cntrbs_graph ). They are all the same strucutre minus 1 line change.

JamesKunstle commented 8 months ago

The PR kinda show there is a huge potentail to refactor around the methods that use caching (exemple create_top_k_cntrbs_graph ). They are all the same strucutre minus 1 line change.

Yeah there are, most certainly. We made an early design choice to optimize for functional clarity at the visualization level, which has lead to a lot of repeating-of-ourselves at the scale we're operating at now. I have some notes on how we can refine this in the future.

JamesKunstle commented 8 months ago

@cdolfi I think I addressed all of your review comments

jberkus commented 8 months ago

So, I feel like I'm missing some big picture information here.

Why drop Redis, which is generally used as an effective caching database? What's the goal?
Why not use materialized views?
Why put the cached data in a different DB on the same server? Or is that just a prelude to moving it to a db on a separate server from the main server?

Thanks, with that framing I'll be able to supply more useful advice.

JamesKunstle commented 8 months ago

@jberkus

So, I feel like I'm missing some big picture information here.

Why drop Redis, which is generally used as an effective caching database? What's the goal?

Why not use materialized views?

Why put the cached data in a different DB on the same server? Or is that just a prelude to moving it to a db on a separate server from the main server?

Thanks, with that framing I'll be able to supply more useful advice.

Very good questions.

Redis is a great caching database, we were just using it in a way that wasn't going to scale well, and that required a lot more memory than we had. Our data is bigger than memory, so Redis no longer works for us. Our data is also table-like when cached, so it made little sense to force Redis or another k/v db to be a row-oriented database (mget-ing loads of values by keys) when something relational is already very good at that, and supports retrieval optimization stuff like indexing. I looked into a variety of nosql databases as well and concluded that, of the available options (Cockroach, Maria, Mongo, etc.) Postgres was still probably the best bet. If there was an even simpler SQL db on the market that was as well-vetted as pg I would have considered it, but Postgres fit our needs well. (I did try sqlite and duckdb, and they were cool, but insufficient on OCP.)
We can, and do, use materialized views as much as possible. Caching is a matter of data locality. The main Augur database that we use is in Columbia, Missouri, and I believe that our OCP boxes are in US-west. In testing, it's orders of magnitude faster to query data once from Augur, store it in a db that lives on the same OCP boxes as our app server, and serve the data hence from there.
answered in 2

I hope that answers your questions well.

My ideal roadmap involves having the main database and the application servers live on the same hardware so we can gut this data caching layer and focus on service-time optimizations. We're not there yet, so this is a medium-term fix that gives us a lot more time to focus on analysis rather than data engineering.

JamesKunstle commented 8 months ago

@cdolfi

Updated db_init.py to have more instructions,
Changed "ymmv" to "your mileage may vary"

@mscherer

password for Postgres cache set in environment

@sgoggins

what do you think about the cntrb_id being shortened based on what we said? We could keep them full-length for analysis and shorten them for visualization too, which would solve both the aesthetic and the correctness arguments.

Once all reviewers are satisfied, we can merge this PR into the temp branch that it's pointed at and start the process of rebasing the changes that have been added to the 'dev' branch while I've been working on this feature.

After that, I'll add the enterprisedb image into the Openshift definition of the app and write the cache invalidation cronjob that cleans out cache entries that are 48hours or older once every 12 hours.

mscherer commented 8 months ago

Seems good for me. I was about to ask about cache invalidation, but there is a cronjob being planned, so that should be enough.

oss-aspen / 8Knot

switch caching to use Postgres from Redis #543