pulibrary / orangelight

PUL Blacklight Project (Ruby 3.1.0, Rails 6.1.7)
20 stars 5 forks source link

Repeated Timeout Errors Sending Mail. #3975

Open kevinreiss opened 4 months ago

kevinreiss commented 4 months ago

Expected behavior

When a user places a request the system sends out one or more emails notifying staff of the request and confirming with the user that the request was placed.

Actual behavior

User submits a request for an item. The system displays an error message to the user because the send_mail action for one of the notification emails has generated a Timeout error. However, the API calls to place the request actually succeed so the request is successfully accepted by the desired systems (i.e. Alma, ILLiad, SCSB). The displayed failure notice in the UI then leads the user to attempt to place the request again which results in another error displayed because the processing systems think it's a duplicate request.

Steps to replicate

This an intermittent issue that is difficult to reliably replicate. This issue has increased in frequency in the last ten days, it's happening roughly 3-5 times a day.

Impact of this bug

This provides a degraded user experience. When the timeout error happens the UI tells the user the request has encountered an error and they try to request the item again. Since the API calls are actually successfully, the second attempt fails and generates a notification to the recapproblems email queue saying the user attempted to request a barcode that is not available.

Honeybadger link and code snippet, if applicable

See most recent honeybadger errors that fit this pattern, there are three distinct errors that appear all in a send_mail method.

https://app.honeybadger.io/projects/54399/faults/99737318 - Net::ReadTimeout https://app.honeybadger.io/projects/54399/faults/101572128 - Net::OpenTimeout https://app.honeybadger.io/projects/54399/faults/105534722 - Net::ReadTimeout

kevinreiss commented 4 months ago

This also seems to be happening in approvals:

https://app.honeybadger.io/projects/67941/faults/105117183 https://app.honeybadger.io/projects/67941/faults/105650358

kevinreiss commented 4 months ago

Also in the catalog email feedback options: https://app.honeybadger.io/projects/54399/faults/105717431

kevinreiss commented 4 months ago

There are multiple permutations of this issue that can be seen in honeybadger in recent weeks. All of the errors starting with Net::ReadTimeout seem to be a variation of the same thing. The system reaches a point where it is trying to send mail and the action logs and error. Looking at the pattern they have occurred occasionally over a long period of time, but switched to happening multiple times per day around March 7th.

christinach commented 4 months ago

One approach: Check for the Net::ReadTimeout error and when it happens , retry sending the email.

kevinreiss commented 4 months ago

John K. Provided this sample from the 3/18 mail log


[root@lib-ponyexpr systems]# tail -4000 /var/log/maillog | grep "did not issue MAIL/EXPN/VRFY/ETRN"
Mar 18 00:01:49 lib-ponyexpr sendmail[10262]: 42I41nDm010262: catalog1.princeton.edu [128.112.201.161] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 08:44:56 lib-ponyexpr sendmail[12613]: 42ICiuXG012613: catalog1.princeton.edu [128.112.201.161] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 09:37:07 lib-ponyexpr sendmail[12872]: 42IDb7uc012872: catalog3.princeton.edu [128.112.200.180] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 11:10:12 lib-ponyexpr sendmail[13451]: 42IFACWp013451: catalog3.princeton.edu [128.112.200.180] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 11:10:38 lib-ponyexpr sendmail[13455]: 42IFAc00013455: catalog3.princeton.edu [128.112.200.180] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 11:13:36 lib-ponyexpr sendmail[13472]: 42IFDaAq013472: catalog3.princeton.edu [128.112.200.180] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 11:21:13 lib-ponyexpr sendmail[13520]: 42IFLDEd013520: catalog1.princeton.edu [128.112.201.161] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 11:44:04 lib-ponyexpr sendmail[13710]: 42IFi4JL013710: catalog1.princeton.edu [128.112.201.161] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 12:40:09 lib-ponyexpr sendmail[14169]: 42IGe9qQ014169: catalog3.princeton.edu [128.112.200.180] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 13:41:03 lib-ponyexpr sendmail[14677]: 42IHf3Q9014677: catalog2.princeton.edu [128.112.201.229] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 15:12:14 lib-ponyexpr sendmail[15264]: 42IJCEXW015264: catalog3.princeton.edu [128.112.200.180] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 15:33:42 lib-ponyexpr sendmail[15424]: 42IJXg6e015424: catalog2.princeton.edu [128.112.201.229] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 15:48:30 lib-ponyexpr sendmail[15516]: 42IJmUEZ015516: catalog3.princeton.edu [128.112.200.180] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 16:38:31 lib-ponyexpr sendmail[15925]: 42IKcVZ1015925: catalog2.princeton.edu [128.112.201.229] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 16:45:06 lib-ponyexpr sendmail[15955]: 42IKj6Pl015955: catalog2.princeton.edu [128.112.201.229] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 16:52:55 lib-ponyexpr sendmail[16001]: 42IKqtCd016001: catalog1.princeton.edu [128.112.201.161] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 17:21:54 lib-ponyexpr sendmail[16142]: 42ILLsox016142: catalog1.princeton.edu [128.112.201.161] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 17:30:38 lib-ponyexpr sendmail[16178]: 42ILUcni016178: catalog3.princeton.edu [128.112.200.180] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 17:47:17 lib-ponyexpr sendmail[16306]: 42ILlH7I016306: catalog2.princeton.edu [128.112.201.229] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 18:16:08 lib-ponyexpr sendmail[16464]: 42IMG8ed016464: catalog2.princeton.edu [128.112.201.229] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 20:09:28 lib-ponyexpr sendmail[16911]: 42J09SkW016911: catalog2.princeton.edu [128.112.201.229] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 20:11:49 lib-ponyexpr sendmail[16931]: 42J0BnBN016931: catalog2.princeton.edu [128.112.201.229] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 20:28:50 lib-ponyexpr sendmail[17004]: 42J0SoOx017004: catalog1.princeton.edu [128.112.201.161] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Mar 18 21:21:39 lib-ponyexpr sendmail[17274]: 42J1LdEn017274: catalog1.princeton.edu [128.112.201.161] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA```
kevinreiss commented 4 months ago

Two possible fixes:

  1. @christinach suggestion: Check for the Net::ReadTimeout error and when it happens , retry sending the email.
  2. Configure the mail server to use SMTP authenticated mail with a service account and send mail directly to the campus mail server. (This is what Alma has been configured to do for example).
kevinreiss commented 4 months ago

Since this was deployed at approx 4:30 on 3/20 no errors related to mail timeouts recorded. Will see if this holds up today, I may reduce the open_timeout down to 3~5 seconds as it seems to be actually waiting 10 seconds to submit.

kevinreiss commented 4 months ago

In the five days since this update was put in production, we've seen a timeout error once related to email. This has made an impact but it does not completely get rid of the issue.

kevinreiss commented 4 months ago

possibly related to bibdata#2320, both network related issues started about the same time.

kevinreiss commented 3 months ago

Need to catch up on changes DLS has made to their SMTP infrastructure that can potentially make our set-up more robust.