mysociety / whatdotheyknow-theme

The Alaveteli theme for WhatDoTheyKnow (UK)
http://www.whatdotheyknow.com/
MIT License
31 stars 26 forks source link

Reindex InfoRequestEvents with new request_public_body_tag term #344

Open garethrees opened 7 years ago

garethrees commented 7 years ago

~1912247 events total

indexes 50-100 every 5 minutes in normal operation (from acts_as_xapian_jobs table) Takes < 30 seconds to index them all ~300 safely in the 5 minute slot?

= index 3600 ph = 531.479615385 hours for all = 22.132488426 days

Idea is that we drip-feed info request events to the jobs table so that it gradually reindexes the old events with the new term.

garethrees commented 7 years ago

This is the gist of what I think we want:

InfoRequestEvent.find_in_batches(:batch_size => 300) do |events|
  events.each(&:xapian_mark_needs_index)
  sleep 300 # 5 mins so that the next batch gets collected by the next indexing run
end

Need to consider:

lizconlan commented 7 years ago

to pick up where we left off (or at least close to it), something like...

start_id = ENV["START_ID"] || 0
InfoRequestEvent.where("id > #{start_id}").find_in_batches(:batch_size => 300) do |events|
   events.each(&:xapian_mark_needs_index)
   logger.info("last event indexed: #{events.last.id}")
   sleep 300 # 5 mins so that the next batch gets collected by the next indexing run
end

So if the last thing (success) in the log is 299, set the START_ID to 299 to kick the next batch off at 300

garethrees commented 7 years ago

Looks good. I think next thing to do is try this with an initial handful of batches to check that we can effectively process the jobs in the 5 minute window.

garethrees commented 7 years ago

Just a reminder - we need to update wdtk before we do this for real.

lizconlan commented 7 years ago

Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?

lizconlan commented 7 years ago

And is using logger to write to the existing log sufficient?

garethrees commented 7 years ago

Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?

I don't think so, but worth a double check of the deploy tasks to make sure something like that isn't going to happen.

And is using logger to write to the existing log sufficient?

I was wondering about this. My first thought was to create a separate logger just for clarity, but we could just add a prefix to log messages generated by this for easy grepping. I have no real preference – whatever you think will make it easier to check every day.

Will also want to make sure exceptions are logged.

lizconlan commented 7 years ago

Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?

I don't think so, but worth a double check of the deploy tasks to make sure something like that isn't going to happen.

There doesn't seem to be anything in the deploy script, should be ok

garethrees commented 7 years ago

Make the rake task of alaveteli itself – useful for everyone.

garethrees commented 7 years ago

bundle exec rake reindex:events is now running in a screen session (under my user, sudo-ed to app user).

garethrees commented 7 years ago

Indexing has been stopped because of https://github.com/mysociety/alaveteli/issues/3604.

Abort message:

* queued batch ending: 175719
** Error while processing event 175719, last event successfully queued was: 175719

Also note that the task keeps hold of the logrotated file. Marked as new for discussion alongside https://github.com/mysociety/alaveteli/issues/3604.

garethrees commented 1 year ago

Some notes on what this is about:

If not completed it would mean that the request_public_body_tag advanced search term wouldn’t have a full dataset to search on. Not sure if there’s an easy way of finding that out

The search engine indexes events (so that it can look at historic states and whatnot). To be able to search for events where the request’s public body has the given tag, the search index needs to get updated with that information for each event (a lot of events!). Updates are handled automatically, but that initial seeding needed to be manual (or we could just wait until every request gets updated in normal course, but that would probably take tens of years for all of them)

garethrees commented 1 year ago

To reduce the set of events we could try to inspect the xapian value for the term so that we only mark for reindexing if it's empty.