stats4sd / aec_portfolio

A proof of concept for the AEC Consortium Project Management / Assessment System
GNU General Public License v3.0
0 stars 0 forks source link

[BUG] Error Occurred when Getting Exchange Rate for One Day #282

Closed dan-tang-ssd closed 2 months ago

dan-tang-ssd commented 2 months ago

Both staging env and live env has error when getting exchange rate for one day. It looks like the server that providing exchange rate data is not accessible at that moment.


Follow-up actions:

  1. Remote login to server of staging env, manally run command to get exchange rate for one day. See whether the full set of exchange rate data (i.e. 1089 exchange_rates records for 2024-09-01) can be retrieved.
  2. Do the same in live env if it works fine in staging env.
  3. Observe a few days to see whether the same error will occur again.

Staging env error: https://stats4sd-53.sentry.io/issues/5786019463/?referrer=alert_email&alert_type=email&alert_timestamp=1725256900272&alert_rule_id=15309081&notification_uuid=3ba4fe91-ca36-49d0-b43d-b599fab0f110&environment=staging

Live env error: https://stats4sd-53.sentry.io/issues/5786037460/?alert_rule_id=15326462&alert_timestamp=1725257487923&alert_type=email&environment=production&notification_uuid=00725467-ff01-400b-8104-fb71bd09014d&project=4507611585642496&referrer=alert_email

dan-tang-ssd commented 2 months ago

I remote logged in to server, manually run command to get exchange rate data for one day. It works properly in both staging env and live env.

Discussed this issue in Engineering team catchup, Dave advised to change the time for adding the failed job back to the queue. E.g. change from 5 minutes to 1 hour. So that it has a higher chance to do the job without human intervention.

dan-tang-ssd commented 2 months ago

Checked that we can specify queue policy when creating a queue in Forge site.

For AEC, the queue is solely used for getting exchange rate data. We can specify it to retry the failed job after 1 hour, retry 8 times and then give up. The first time to run the job is 6:00 AM, after 8 attempts it will be 2:00 PM.

Hopefully the 3rd party server will be resumed within 8 hours. Or we will receive Sentry email alert for this.

I have removed the existing queue and create a new queue with above policy in both staging env and live env. Let's see whether it works when same issue happen again next time.


Screen shot:

image