jantman / biweeklybudget

Responsive Flask/SQLAlchemy personal finance app, specifically for biweekly budgeting.
http://biweeklybudget.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
103 stars 19 forks source link

Errors running in ECS with minimal mariadb #181

Closed jantman closed 6 years ago

jantman commented 6 years ago

I'm running this in ECS on a tiny t2.micro instance. I have a mariadb:10.3.2 container running on the same instance for the database.

I've started seeing a lot of 500s...

pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'SELECT budgets.id AS budgets_id, budgets.is_periodic AS budgets_is_periodic, budgets.name AS budgets_name, budgets.description AS budgets_description, budgets.starting_balance AS budgets_starting_balance, budgets.current_balance AS budgets_current_balance, budgets.is_active AS budgets_is_active, budgets.is_income AS budgets_is_income, budgets.omit_from_graphs AS budgets_omit_from_graphs \nFROM budgets'] (Background on this error at: http://sqlalche.me/e/e3q8)

However, the mariadb logs are showing messages like:

[Warning] Aborted connection 175 to db: 'XXX' user: 'XXX' host: 'XXX' (Got timeout reading communication packets)

MariaDB is incrementing Aborted_clients but not Aborted_connects and is not incrementing any of the Connection_errors_* variables. MariaDB has its timeouts set to the defaults - 5s for connect_timeout and 28800 for wait_timeout and interactive_timeout.

MySQL docs on communication errors are here .

The SQLAlchemy error message points us to the Pooling docs on handling disconnects.

As far as I can tell, the problem is definitely because of stale connections; it seems to only happen after periods of inactivity in the application. If the app hasn't had any requests for some amount of time (let's say a few minutes, to a few hours) this seems to happen. However, if I repeatedly browse through different views in rapid succession, there's a definite point where the 500s stop happening... which seems to indicate stale connections in the pool.

From the docs, it seems like this should be fixed by adding pool_pre_ping=True to our create_engine() call. If desired, this could be controlled by an environment variable.

jantman commented 6 years ago

Released today in 1.0.0