Reindex InfoRequestEvents with new request_public_body_tag term

garethrees commented 7 years ago

~1912247 events total

indexes 50-100 every 5 minutes in normal operation (from acts_as_xapian_jobs table) Takes < 30 seconds to index them all ~300 safely in the 5 minute slot?

= index 3600 ph = 531.479615385 hours for all = 22.132488426 days

Idea is that we drip-feed info request events to the jobs table so that it gradually reindexes the old events with the new term.

garethrees commented 7 years ago

This is the gist of what I think we want:

InfoRequestEvent.find_in_batches(:batch_size => 300) do |events|
  events.each(&:xapian_mark_needs_index)
  sleep 300 # 5 mins so that the next batch gets collected by the next indexing run
end

Need to consider:

error logging
how to run? just a 1 off script? (bundle exec rails runner reindex_all_events_in_batches?
crash recovery – if it breaks on one event, how do we avoid indexing everything again

lizconlan commented 7 years ago

to pick up where we left off (or at least close to it), something like...

start_id = ENV["START_ID"] || 0
InfoRequestEvent.where("id > #{start_id}").find_in_batches(:batch_size => 300) do |events|
   events.each(&:xapian_mark_needs_index)
   logger.info("last event indexed: #{events.last.id}")
   sleep 300 # 5 mins so that the next batch gets collected by the next indexing run
end

So if the last thing (success) in the log is 299, set the START_ID to 299 to kick the next batch off at 300

garethrees commented 7 years ago

Looks good. I think next thing to do is try this with an initial handful of batches to check that we can effectively process the jobs in the 5 minute window.

Probably want to monitor the job queue – maybe by taking the count (SELECT COUNT(*) FROM acts_as_xapian_jobs;) every 10 seconds or so – to check that we're not getting a backlog of jobs that we're struggling to process.
Also worth taking notes on resource usage as its processing the batch (New Relic / cacti useful here).

garethrees commented 7 years ago

Just a reminder - we need to update wdtk before we do this for real.

lizconlan commented 7 years ago

Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?

lizconlan commented 7 years ago

And is using logger to write to the existing log sufficient?

garethrees commented 7 years ago

Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?

I don't think so, but worth a double check of the deploy tasks to make sure something like that isn't going to happen.

And is using logger to write to the existing log sufficient?

I was wondering about this. My first thought was to create a separate logger just for clarity, but we could just add a prefix to log messages generated by this for easy grepping. I have no real preference – whatever you think will make it easier to check every day.

Will also want to make sure exceptions are logged.

lizconlan commented 7 years ago

Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?

I don't think so, but worth a double check of the deploy tasks to make sure something like that isn't going to happen.

There doesn't seem to be anything in the deploy script, should be ok

garethrees commented 7 years ago

Make the rake task of alaveteli itself – useful for everyone.

garethrees commented 7 years ago

bundle exec rake reindex:events is now running in a screen session (under my user, sudo-ed to app user).

garethrees commented 7 years ago

Indexing has been stopped because of https://github.com/mysociety/alaveteli/issues/3604.

Abort message:

* queued batch ending: 175719
** Error while processing event 175719, last event successfully queued was: 175719

Also note that the task keeps hold of the logrotated file. Marked as new for discussion alongside https://github.com/mysociety/alaveteli/issues/3604.

garethrees commented 1 year ago

Some notes on what this is about:

If not completed it would mean that the request_public_body_tag advanced search term wouldn’t have a full dataset to search on. Not sure if there’s an easy way of finding that out

The search engine indexes events (so that it can look at historic states and whatnot). To be able to search for events where the request’s public body has the given tag, the search index needs to get updated with that information for each event (a lot of events!). Updates are handled automatically, but that initial seeding needed to be manual (or we could just wait until every request gets updated in normal course, but that would probably take tens of years for all of them)

garethrees commented 1 year ago

To reduce the set of events we could try to inspect the xapian value for the term so that we only mark for reindexing if it's empty.

mysociety / whatdotheyknow-theme

Reindex InfoRequestEvents with new request_public_body_tag term #344