Closed escowles closed 2 years ago
Just want to dump some thoughts I've been having this week:
There are two different processes that are important, but have different moving parts.
Before I get into ideas, I want to see if I understand the process for each of these:
Full Reindex
00 05 28 * * /bin/bash -l -c 'cd /opt/marc_liberation/current && RAILS_ENV=production bin/bundle exec bin/rake marc_liberation:bib_dump --silent >> /tmp/cron_log.log 2>&1'
00 05 13 * * /bin/bash -l -c 'cd /opt/marc_liberation/current && RAILS_ENV=production bin/bundle exec bin/rake marc_liberation:bib_dump --silent >> /tmp/cron_log.log 2>&1'
The way this works is the bib ID dump is pulled down, then those bib IDs are sliced up and a dump job is created for each slice. Each bib-ID is pulled from Voyager, combined with its holdings, and then output as Marc-XML. Those records are concatenated into one file and put on disk. Right now this process takes 12 hours. With a refactor branch which dumps as MARC21 instead of XML, it will take 90 minutes.
Incremental Update & Delete
Given the above there are a few problems I see:
The easier process to refactor here is probably the full reindex, and then we can look at how it can connect with the incremental update scenario.
The goals:
Revised Full Index Proposal:
This is largely similar to the existing process. The major difference is that a message buffer sits in the middle to orchestrate an arbitrary number of consumers as a method of distributing the indexing. If there's a 1-1 ratio between "record" and "message" it means we can tweak the settings of the message queue and consumers in order to optimize how fast or slow indexing can happen. Further, if something goes wrong, we can attach diagnostics and retry functionality that we're used to.
Message Queue Options:
RabbitMQ We have some experience here already. RabbitMQ is known to be able to handle 20k messages/second with a single node, and scales up as you add more nodes to the cluster. We've built message ack systems and retries in the past, and know how to monitor these in DataDog. The problem is that RabbitMQ seems to be built in such a way that you're expected to handle each message sequentially. In another language that'd be fine - pull them off RabbitMQ, ack the message, and put it in a buffer until you have a bunch you want to send to Solr (although you might lose records this way). In Ruby we can't do that. So we'd either have to benchmark how slow adding one record at a time is (it looks like quite a bit slower from some brief googling, but without commits I can't be sure.)
Apache Kafka Kafka is more like a message bus than a queue. You put a set of messages into it and they -stay there.- Consumers pick a point along the timeline of all messages and start running through messages at that point. The benefits here are that it's built into and expected by libraries that you can process N messages at once - it pushes a start and end at you and you grab them. It also gives you the ability to roll back to a specific point in the timeline and try again - the entire history stays there. Further, consumers have multiple ways to control how many messages they get - you can define both "stop giving me messages after I have N" and "stop buffering messages after it's been X amount of time," which is a lot of configurability. A single server is expected to handle 100k messages/second. The downside is we've never used Kafka before, and "retries" are a different concept here - you can either queue a duplicate of the previous request, or back up the timeline and start again. You need Zookeeper to even run Kafka, and while libraries have built in hooks to DataDog it's another technology we haven't tried before.
Postgres I haven't looked super far into this. If we wanted to do it we'd have to look into ways to dig through messages in such a way that it ignores table locks, and handle things like keeping stuff in memory. We should probably just avoid it.
A lot of this has been rearchitected with the alma migration. closing as obsolete.