pitt-crc / lmod_tracking

Database ingestion for Lmod usage logs
https://crc-pages.pitt.edu/lmod_tracking/
GNU General Public License v3.0
0 stars 0 forks source link

Ingestion fails for large log files #26

Closed djperrefort closed 10 months ago

djperrefort commented 11 months ago

A few log files are failing to ingest due to a connection error:

2023-08-18 20:23:38,823 - INFO: Ingesting /var/log/lmod.log-20230702
2023-08-18 20:23:38,823 - INFO: Fetching database connection details
2023-08-18 20:23:38,851 - INFO: Parsing log data
2023-08-18 20:23:41,625 - INFO: Ingesting data into database table lmod_tracking.log_data
2023-08-19 04:36:54,377 - ERROR: (mysql.connector.errors.OperationalError) 2013 (HY000): Lost connection to MySQL server during query
 [SQL: INSERT INTO log_data (logname, time, host, user, module, path, `package`, version) VALUES (...) (Background on this error at: https://sqlalche.me/e/20/e3q8)
2023-08-19 04:36:56,520 - INFO: Ingesting /var/log/lmod.log-20230703
2023-08-19 04:36:56,520 - INFO: Fetching database connection details
2023-08-19 04:36:56,558 - INFO: Parsing log data
2023-08-19 04:37:02,153 - INFO: Ingesting data into database table lmod_tracking.log_data
2023-08-20 16:06:08,114 - ERROR: (mysql.connector.errors.OperationalError) 2013 (HY000): Lost connection to MySQL server during query
[SQL: INSERT INTO log_data (logname, time, host, user, module, path, `package`, version) VALUES (...) (Background on this error at: https://sqlalche.me/e/20/e3q8)

The link provided by the error message has this to say:

OperationalError

Exception raised for errors that are related to the database’s operation and not necessarily under the control of the programmer, e.g. an unexpected disconnect occurs, the data source name is not found, a transaction could not be processed, a memory allocation error occurred during processing, etc.

This error is a DBAPI Error and originates from the database driver (DBAPI), not SQLAlchemy itself.

The OperationalError is the most common (but not the only) error class used by drivers in the context of the database connection being dropped, or not being able to connect to the database. For tips on how to deal with this, see the section Dealing with Disconnects.

Further documentation talks about pre ingestion health checks and refreshing stale connection pools - neither of which should be relevant here.

The files being ingested are the two largest log files and error after several hours.

root root  29M Jul  2 03:39 lmod.log-20230702
root root  55M Jul  3 03:27 lmod.log-20230703

At first I thought this might be a timeout error, but the first file error after 6 hours and the second after 15.5. I expect a simple timeout error to be more consistent.

The next largest log files are 11M which ingest without issue:

root root  11M Jun 20 03:45 lmod.log-20230620
root root  11M Jun 21 03:31 lmod.log-20230621

In general, these ingestion times are problematically long. Speeding up the ingestion (not sure how) might alleviate the symptoms, but I'd like to understand the core underlying problem.