pulibrary / pul_solr

PU Library Solr Configs and Specs
Apache License 2.0
5 stars 0 forks source link

OL Solr indexing design proposal #118

Closed escowles closed 2 years ago

escowles commented 5 years ago
tpendragon commented 5 years ago

Just want to dump some thoughts I've been having this week:

There are two different processes that are important, but have different moving parts.

  1. Full Reindex
  2. Incremental Update & Delete

Before I get into ideas, I want to see if I understand the process for each of these:

  1. Full Reindex

    • At some point (during syncs for sure, maybe at some other point) a dump file of all bib IDs in the system is created.
    • Twice a month a full dump happens. The cron for this looks like:
      00 05 28 * * /bin/bash -l -c 'cd /opt/marc_liberation/current && RAILS_ENV=production bin/bundle exec bin/rake marc_liberation:bib_dump --silent >> /tmp/cron_log.log 2>&1'
      00 05 13 * * /bin/bash -l -c 'cd /opt/marc_liberation/current && RAILS_ENV=production bin/bundle exec bin/rake marc_liberation:bib_dump --silent >> /tmp/cron_log.log 2>&1'

      The way this works is the bib ID dump is pulled down, then those bib IDs are sliced up and a dump job is created for each slice. Each bib-ID is pulled from Voyager, combined with its holdings, and then output as Marc-XML. Those records are concatenated into one file and put on disk. Right now this process takes 12 hours. With a refactor branch which dumps as MARC21 instead of XML, it will take 90 minutes.

    • The last full dump is pulled down and processed through Traject to create solr records. This takes 10.5 hours.
  2. Incremental Update & Delete

    • Three times a day a process is run which dumps the changes since the last time a change was run. It creates two files (updated records & created records), and stores a list of IDs to delete. The updated records and created records are in Marc-XML, and have merged holding records.
    • These dumps (which are stored in a database, with references to the files and the list of IDs as a property) are then processed by a rake task through a cron job, which deletes the given IDs and runs traject through all the updated/created records to index solr records. The process is also run three times a day, 35 minutes after the previous process.
tpendragon commented 5 years ago

Given the above there are a few problems I see:

  1. Index processes depend on previous dumps having happened, but have no way to know if the previous dump succeeded or if it's even done. We're just relying on the timing of the cron jobs.
  2. Everything's based around writing files to disk, and then reading those files from disk. Disk IO is pretty expensive, but besides that it means that error reporting, logging, etc are all bound by the batch size we choose for those files on disk.
  3. Incremental Update requires that you know the state of the database at a previous point in time, so that you can see which records have been removed and which holdings have been deleted. If the record was deleted (or suppressed), you need to delete it from the index. If the holding was deleted, you need to update that record.
tpendragon commented 5 years ago

The easier process to refactor here is probably the full reindex, and then we can look at how it can connect with the incremental update scenario.

The goals:

  1. Stop writing to files ourselves. It's expensive, the formats are big, and it depends on network mounted storage being available and reasonably fast. Right now we just dump them to bibdata-worker2, but that means its IO is shared with all other processes happening. When they're running in parallel, that introduces a bottleneck.
  2. Do as much in parallel as possible.
  3. Stream processes to one another.

Revised Full Index Proposal:

  1. Run a query to get every bib ID, then for every bib ID create the combined marc+holdings MARC21 format. Shove this in a message queue. There are a couple options I'll talk about later.
  2. Run N message consumers which pull down these messages, run the MARC21 through traject (or elixindexer - it doesn't matter), and then add the resulting solr documents to solr.

This is largely similar to the existing process. The major difference is that a message buffer sits in the middle to orchestrate an arbitrary number of consumers as a method of distributing the indexing. If there's a 1-1 ratio between "record" and "message" it means we can tweak the settings of the message queue and consumers in order to optimize how fast or slow indexing can happen. Further, if something goes wrong, we can attach diagnostics and retry functionality that we're used to.

Message Queue Options:

  1. RabbitMQ We have some experience here already. RabbitMQ is known to be able to handle 20k messages/second with a single node, and scales up as you add more nodes to the cluster. We've built message ack systems and retries in the past, and know how to monitor these in DataDog. The problem is that RabbitMQ seems to be built in such a way that you're expected to handle each message sequentially. In another language that'd be fine - pull them off RabbitMQ, ack the message, and put it in a buffer until you have a bunch you want to send to Solr (although you might lose records this way). In Ruby we can't do that. So we'd either have to benchmark how slow adding one record at a time is (it looks like quite a bit slower from some brief googling, but without commits I can't be sure.)

  2. Apache Kafka Kafka is more like a message bus than a queue. You put a set of messages into it and they -stay there.- Consumers pick a point along the timeline of all messages and start running through messages at that point. The benefits here are that it's built into and expected by libraries that you can process N messages at once - it pushes a start and end at you and you grab them. It also gives you the ability to roll back to a specific point in the timeline and try again - the entire history stays there. Further, consumers have multiple ways to control how many messages they get - you can define both "stop giving me messages after I have N" and "stop buffering messages after it's been X amount of time," which is a lot of configurability. A single server is expected to handle 100k messages/second. The downside is we've never used Kafka before, and "retries" are a different concept here - you can either queue a duplicate of the previous request, or back up the timeline and start again. You need Zookeeper to even run Kafka, and while libraries have built in hooks to DataDog it's another technology we haven't tried before.

  3. Postgres I haven't looked super far into this. If we wanted to do it we'd have to look into ways to dig through messages in such a way that it ignores table locks, and handle things like keeping stuff in memory. We should probably just avoid it.

hackartisan commented 2 years ago

A lot of this has been rearchitected with the alma migration. closing as obsolete.