openfoodfoundation / openfoodnetwork

Connect suppliers, distributors and consumers to trade local produce.
https://www.openfoodnetwork.org
GNU Affero General Public License v3.0
1.1k stars 714 forks source link

500 error due to database connection pool being empty #12761

Closed rioug closed 5 days ago

rioug commented 1 month ago

Description

uk_prod is sometimes getting database connection error, indicating the connection pool is empty : bugnsag This prevent the website from loading.

This is most likely due to a high number of Rack Timeout errors : https://app.bugsnag.com/yaycode/openfoodnetwork-uk/errors/664378ec33e1080008ba685a?filters[error.status]=open&filters[event.since]=30d It's not known why these timeout are happening, one hypothesis is the server was busier than usual, maybe because a higher than usual number of report were ran at the time. Report shouldn't affect the database connection pool, as they are generated in the background and Sidekiq has it's own connection pool, but the can they can add significant load to the server, making other request slow.

This article https://medium.com/@mendespedro77/solving-activerecord-connection-pool-errors-in-rails-applications-b7a5861573b9 provides various suggestion we can follow to get to the bottom of this.

For now we applied a configuration change for rack-timeout that should mitigate the issue : https://github.com/openfoodfoundation/ofn-install/pull/932 and we made a little change in our logging which would give us more information if the issue crops up again : https://github.com/openfoodfoundation/openfoodnetwork/pull/12715

This error has also been seen on fr prod on the 5th of August 2024 : https://app.bugsnag.com/open-food-france/coopcircuits/errors/66afb3f75e9c074bc47a6cb2?event_id=66afb3f700f69a1692450000&i=sk&m=nw

Expected Behavior

The website loads without error

Actual Behaviour

The website doesn't load and return a 500 error

Steps to Reproduce

Animated Gif/Screenshot

Workaround

Restart the server/puma or apply config change https://github.com/openfoodfoundation/ofn-install/pull/932

Severity

bug-s3: a feature is broken but there is a workaround

Your Environment

Possible Fix

rioug commented 4 weeks ago

The configuration seems to be effective on uk_prod , the problem has not occurred since the configuration change. I also observed this in the logs:

[43bd9a87-443d-46ff-82e3-b9e8bc749552] Rack::Timeout::RequestTimeoutException (Request ran for longer than 120000ms , 2/3 timeouts allowed before SIGTERM for process 5357)
rioug commented 4 weeks ago

The error showed up on fr_prod again on the 17th of August : https://app.bugsnag.com/open-food-france/coopcircuits/errors/66c04a487fce022d1c502938?event_id=66c04a4800f8769690320000&i=sk&m=nw

rioug commented 2 weeks ago

The error showed up on fr_prod on the 1st of September : https://app.bugsnag.com/open-food-france/coopcircuits/errors/66d4a99b0ab81c7871222ec5?event_id=66d4a99b00f88fc102370000&i=sk&m=nw

The fix for fr_prod had not been deployed, this now done.

rioug commented 5 days ago

From the fr_prod log :

Rack::Timeout::RequestTimeoutException (Request waited 2ms, then ran for longer than 119998ms , 1/3 timeouts allowed before SIGTERM for process 3585187)

It looks like it's working as intended, we have not seen the problem since the configuration change.