dacook commented 1 year ago

The UK server has had downtime, which seems to be a result of memory fully allocated. We were working around this by restarting Puma daily, although for the last week this didn't seem to help.

So let's hurry up and make use of the new background reports feature. It's still in progress, but will be an improvement. https://openfoodnetwork.slack.com/archives/C01T75H6G0Z/p1681840327630489

It should significantly reduce memory issues

The UX is slightly better than before (a message with a link instead of a 500 snail)

Prep

Run a few reports to test baseline, and document results
Check some stats for baseline

Deployment plan

Toggle feature fully on (https://openfoodnetwork.org.uk/admin/feature-toggle/features/background_reports)
Restart Puma to release memory (because it will no longer be used by Puma, Sidekiq is going to need it)
Validation: check the same reports again
Check stats again

Rollback plan

Toggle feature fully off (https://openfoodnetwork.org.uk/admin/feature-toggle/features/background_reports)
Restart Sidekiq to release memory
Validation: check the same reports again

dacook commented 1 year ago

Before

Report test

Order Cycle Supplier Totals

1 month: 200 ✅ after 12s
1 year: 200 ✅ after 2.0m (I'm wondering if due to a DB timeout, so perhaps the data set isn't complete?)
All time: 500 🐌 after approx 2:30m. Looks like the worker process was forcefully terminated due to system out of memory.

Memory usage

Before running reports:

ofn-admin@production18:~$ ps -eo size,pid,user,command | egrep 'puma|sidekiq'
76700 11211 openfoo+ puma 6.2.2 (unix:///home/openfoodnetwork/apps/openfoodnetwork/shared/sock/puma.openfoodnetwork.sock) [2023-04-25-131137]
1914572 22139 openfoo+ sidekiq 7.0.9 2023-04-25-131137 [0 of 5 busy]
1498212 25734 openfoo+ puma: cluster worker 0: 11211 [2023-04-25-131137]
1703028 25793 openfoo+ puma: cluster worker 1: 11211 [2023-04-25-131137]

https://app.datadoghq.com/dashboard/bdw-2na-83i/openfoodnetworkorguk-cloned Screen Shot 2023-04-27 at 12 41 47 pm After report terminated: Screen Shot 2023-04-27 at 1 55 32 pm

dacook commented 1 year ago

After

Report test

Order Cycle Supplier Totals

On screen ❌ failed (due to ActionView::Template::Error)
- 1 month on-screen: 500 🐌 after 13s
- 1 year on-screen: 500 🐌 after 2.0m
PDF
- 1 month PDF: 200 ✅ after 13s
- 3 months PDF: 200 ✅ after 36s
- 1 year PDF: 200 ✅ after 3.1m
- All time PDF: 200 ✅ "This report is taking longer to process" after 12.0m (but the job was terminated after 2 min when system out of memory)
  - "Download report (when available)": 404 🆇

Memory usage

After running report for 1 year (took extra memory but didn't reach system limit). We can see that Sidekiq has allocated the most memory, because it is now running the report.

ofn-admin@production18:~$ ps -eo size,pid,user,command | egrep 'puma|sidekiq'
76700 11211 openfoo+ puma 6.2.2 (unix:///home/openfoodnetwork/apps/openfoodnetwork/shared/sock/puma.openfoodnetwork.sock) [2023-04-25-131137]
1635444 13831 openfoo+ puma: cluster worker 0: 11211 [2023-04-25-131137]
1913708 13881 openfoo+ puma: cluster worker 1: 11211 [2023-04-25-131137]
4164016 16145 openfoo+ sidekiq 7.0.9 2023-04-25-131137 [0 of 5 busy]

Conclusion

Not good. There's an error in displaying all onscreen results, so we need to abort.

Also:

Very large reports can still fail due to out of memory, because we still have the same memory limits.
When this feature is enabled, we can reduce the UK server timeout back to a normal level (eg 1-2min).
When the report dies, the puma worker and browser continues to wait for a result. This should already be fixed in the next iteration in development.

dacook commented 1 year ago

Rollback: I've disabled the feature toggle and restarted both sidekiq and puma. Memory usage is back to a normal amount. Tested to confirm:

1 month: 200 ✅ after 13s

I can see that the last two nightly puma restarts successfully reset the memory, so I think no further action required in the short term.

openfoodfoundation / openfoodnetwork

Enable background reports on UK #10757

Prep

Deployment plan

Rollback plan

Before

Report test

Memory usage

After

Report test

Memory usage

Conclusion