Network errors cause mmsIDs to fail to index

christinach commented 5 months ago

On March 14th, we noticed some network errors that randomly caused a number of mms_ids to fail to index. These are not errors related to the data.

The network errors can be found in the traject logs from both bibdata-alma-workers with a March 14th time stamp.

example log:

2024-03-14T16:14:45-04:00 [ERROR] 841218 : Could not add record 99130256719806421 at source file position 4339:     Exception: Errno::ECONNREFUSED: Connection refused - Connection refused - connect(2) for "lib-solr8-prod.princeton.edu" port 8983 (lib-solr8-prod.princeton.edu:8983)
    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:623:in `rescue in create_socket'
    Caused by
    Errno::ECONNREFUSED: Connection refused - connect(2) for "lib-solr8-prod.princeton.edu" port 8983
    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:611:in `initialize'

Acceptance Criteria

[x] Better understanding of what's causing the errors https://github.com/pulibrary/bibdata/issues/1929
[x] Check with operations team regarding the physical location of solr8 production. https://github.com/pulibrary/bibdata/issues/2320#issuecomment-2020977947
[x] Any recommendations for configuration settings we might change to address. Part of this improvements through bibdata#1929
[x] Any recommendations/tickets we can create to make the code handle these errors. Part of this the result of bibdata#2326

Implementation Notes

Traject does not include any rails helpers because it doesn't load the rails environment when in a non-test setting.

Honeybadger error:

honeybadger#103044592

kevinreiss commented 5 months ago

Just noting that network errors observed on the traject logs on bibdata-alma-worker{1,2} the week of 3/18 take a different form and don't log a specific MMS ID:

They show up as:


    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:625:in `rescue in create_socket'
    Caused by
    SocketError: getaddrinfo: Temporary failure in name resolution
    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:611:in `initialize'```

kevinreiss commented 4 months ago

A third permutation 2024-03-24T20:20:19-04:00 [FATAL] 2007455 : Traject::CommandLine: Unexpected exception, terminating execution: Exception: SocketError: getaddrinfo: Temporary failure in name resolution (lib-solr8-prod.princeton.edu:8983)

christinach commented 4 months ago

In the open house on 3/26/2024 we checked the physical location of the solr-prod and bibdata-prod VMs. We noticed that the lib-solr-prod8 boxes are in Forrestal while the bibdata-alma-workers and bidata web servers are in New South. We moved the bibdata-alma-worker(1,2) from New South to Forrestal.

Ops informed us about a private non routable IPs update that failed for host 1a which is in Forrestal and they are currently working on this upgrade. The upgrade started on March 11th, 2024 which is very close to the time we started seeing indexing failures because of Temporary failure in name resolution (lib-solr8-prod.princeton.edu:8983).

We will monitor the logs today (3/26/2024) and tomorrow and let Ops know if moving the VMs fixed the issue.

Thank you everyone who worked on this during the Ansible open house: @VickieKarasic @acozine @kayiwa @carolyncole @sandbergja @maxkadel @leefaisonr @christinach

kevinreiss commented 4 months ago

Last timestamp of a network related error on bibdata-alma-worker{1,2} is

2024-03-26T14:52:49-04:00 [FATAL] 2651102 : Traject::CommandLine: Unexpected exception, terminating execution: Exception: HTTPClient::ReceiveTimeoutError: execution expired

christinach commented 4 months ago

The logs from yesterday indicate that the last time the issue was triggered was at 2024-03-26T14:52:49-04:00. We will keep an eye to see if the network issue was fixed and there are no errors today. We moved the bibdata worker VMs in the same physical location with the solr-prod VMs at 12:40pm. My understanding is that if the issue was fixed after 2024-03-26T14:52:49-04:00 it was for a different reason and not because of moving the VMs in the same location.

christinach commented 4 months ago

Added two more logs in traject error logs worker1 and 2. The error timestamp is at 2024-03-27T21:52:52-04:00. The first error occurs when Solr response: 503: because of .."msg":"No registered leader was found after waiting for 4000ms ,..

The second error at 2024-03-27T21:52:52-04:00 occurs when Solr response: 500: because of Error from server at http://lib-solr-prod6.princeton.edu:8983/solr/catalog-alma-production3_shard1_replica_n7/: null\n\n\n\nrequest: http://lib-solr-prod6.princeton.edu:8983/solr/catalog-alma-production3_shard1_replica_n7/\nRemote error message: java.io.IOException: No space left on device"

christinach commented 4 months ago

@maxkadel and I are looking in datadog: time window March 27 2024 21:51pm - 22:04:pm

lib-solr-prod6 error

org.apache.solr.common.SolrException: missing content stream
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:63)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:578)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.Server.handle(Server.java:505)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
    at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
    at java.lang.Thread.run(Thread.java:750)

2024/03/28 01:51:00 [warn] 756033#756033: *13916733 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000884822, client: 128.112.204.56, server: lib-solr8-prod.princeton.edu, request: "POST /solr/catalog-alma-production/update/json HTTP/1.1", host: "lib-solr8-prod.princeton.edu:8983"

christinach commented 4 months ago

Today operations added 100 GB on each one of the solr-prod boxes to avoid one of the above errors failing to connect and build the replica because there is no space. See docs from the incident report.

christinach commented 4 months ago

On March 28th, 2024:

We noticed that there were two duplicate events that were causing the index to loop in between them
We pointed the IndexManager to the next dump and it indexed all the dump files.
We noticed indexing errors every time we restart the load balancer. Created bibdata#2326 to investigate

On March 29th, 2024:

I checked the traject logs in both workers and can't see the connection error after 2024-03-28T12:49:50-04:00. Note that we pointed the IndexManager to the next dump file of the two identical ones on March 28th, 2024 around 1:40pm.
Confirmed that the records that were reported as missing or not up to date are now in a good state.

christinach commented 4 months ago

Suggestion 1: Add a feature in the admin UI in bibdata where we show information from the Indexing Manager, for example: _lastdump indexed, the date timestamp of the event, and the _inprogress boolean. Suggestion 2: Create an alert in datadog which will trigger if the _lastdump indexed event timestamp is older than one day from today.

pulibrary / bibdata