pulibrary / pulfalight

This is an implementation of the Princeton University Library Finding Aids (PULFA) service using ArcLight
Other
7 stars 1 forks source link

Re-enable replication to lib-solr-prod4 #936

Closed hackartisan closed 2 years ago

hackartisan commented 2 years ago

timeout errors when indexing the following:

All of these had a handfull to a dozen identical indexing jobs retrying in sidekiq. They must be getting queued up repeatedly. I killed duplicates for all of these, but I expect they will pile up again.

When they fail it looks like they write the entire EAD record to the logs. This is causing a space issue on pulfalight-worker1. Currently its drive is full from 30G of production logs.

[PULFALight/production] RSolr::Error::Http: RSolr::Error::HttpURI: http://lib-solr-prod4.princeton.edu:8983/solr/pulfalight-production/update?wt=jsonRequest Headers: {"Content-Type"=>"application/json"}Request Data: "[{\"id\":\"MC016\",\"ead_ssi\":\"MC016\",\"title_ssm\":\"John Foster Dulles Papers\",\"title_teim\":\"John Foster Dulles Papers\",\"subtitle_ssm\":\"John Foster Dulles Papers\",\"subtitle_teim\":\"John Foster Dulles Papers\",\"ark_tsim\":\"http://arks.princeton.edu/ark:/88435/br86b3576\",...

Backtrace

line 58 of [PROJECT_ROOT]/app/jobs/aspace_index_job.rb: perform

View full backtrace and more info at honeybadger.io

Sudden Priority Justification

Archivists are unable to see their changes until this is fixed.

tpendragon commented 2 years ago

We think this is a similar issue to Figgy - replication across data centers times out and breaks. I've removed lib-solr-prod4 as a target for Pulfalight and retried the jobs.

tpendragon commented 2 years ago

These succeeded after removing that replication.

tpendragon commented 2 years ago

Ops has moved lib-solr-prod4 to the same datacenter as 5/6. Re-enable replication during maintenance week and close this.