[dataspace] Production Environment raises PostgreSQL connectivity errors when restarting

pulibrary / dspace-development

DSpace infrastructure and development resources for the Princeton University Library.

https://dspace-development.readthedocs.io/en/latest/

1 stars 0 forks source link

[dataspace] Production Environment raises PostgreSQL connectivity errors when restarting #829

Closed jrgriffiniii closed 5 days ago

jrgriffiniii commented 2 months ago

The following is logged when restarting or rebuilding the DataSpace production deployment:

Caused by: javax.servlet.ServletException: org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
        at org.apache.jasper.runtime.PageContextImpl.doHandlePageException(PageContextImpl.java:917)
        at org.apache.jasper.runtime.PageContextImpl.handlePageException(PageContextImpl.java:846)
        at org.apache.jsp.error._404_jsp._jspService(_404_jsp.java:291)
        at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:71)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
        at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:477)
        ... 23 more

jrgriffiniii commented 2 months ago

This is no longer a high priority, as the DataSpace production deployment was restored. However, I am still uncertain as to what triggered the restoration of PostgreSQL connections.

jrgriffiniii commented 2 months ago

This error arose once again following the server maintenance procedures undertaken on a weekly basis. For this incident, there are reportedly difficulties related to the Google Cloud Platform service provider, and this has been blocking me from accessing the production PostgreSQL server which is required by the production deployment of DataSpace.

jrgriffiniii commented 2 months ago

I could not connect to the database server from the production DataSpace server, however, Alicia has identified the server on the GCP with: https://console.cloud.google.com/logs/query;query=resource.type%3D%22cloudsql_database%22%0Aresource.labels.database_id%3D%22pul-gcdc:pul-prod-db%22%0Alog_name%3D%22projects%2Fpul-gcdc%2Flogs%2Fcloudsql.googleapis.com%252Fpostgres.log%22;cursorTimestamp=2024-06-26T14:43:03.433424Z;duration=PT1H?q=search&referrer=search&project=pul-gcdc

jrgriffiniii commented 2 months ago

The following is being logged from the DataSpace prod host:

{
  "textPayload": "2024-06-26 14:59:28.453 UTC [812056]: [1-1] db=dataspace_prod_db,user=dataspace_prod_db_user FATAL:  password authentication failed for user \"dataspace_prod_db_user\"",
  "insertId": "s=00ecd4b5b5f14233aeaab022178f6666;i=23a817;b=2eedae0a03d44e5c81bede0663c852aa;m=e2174a7a63;t=61bcc42cfac6f;x=b70bf5e216e07c98-0@a3",
  "resource": {
    "type": "cloudsql_database",
    "labels": {
      "database_id": "pul-gcdc:pul-prod-db",
      "project_id": "pul-gcdc",
      "region": "us-east4"
    }
  },
  "timestamp": "2024-06-26T14:59:28.453584Z",
  "severity": "ALERT",
  "labels": {
    "INSTANCE_UID": "11-c6b91f46-57e1-4f05-bece-999102cc27a8",
    "LOG_BUCKET_NUM": "56",
    "SOURCE_ID": "30373532306631343535316233303537e3b0c442"
  },
  "logName": "projects/pul-gcdc/logs/cloudsql.googleapis.com%2Fpostgres.log",
  "receiveTimestamp": "2024-06-26T14:59:46.102111106Z"
}

jrgriffiniii commented 2 months ago

A consistent password update needed to be applied to connect to the GCP database server.

jrgriffiniii commented 2 months ago

Clearing the idle connections over psql was necessary to restore service for DataSpace.

jrgriffiniii commented 2 months ago

Following tomorrow morning it shall be determined whether or not this should be reopened.

jrgriffiniii commented 1 month ago

Following a meeting this morning led by @kayiwa and @VickieKarasic, I am going to be added to the list of users who receive Monit alerts in order to try and troubleshoot the Postgres sessions which remain idle in the connection pool.

kelynch commented 1 month ago

Adding this to the upcoming sprint in the event that it comes up again during the migration and needs to be worked.

jrgriffiniii commented 1 week ago

This once again occurred during the maintenance window for this morning. I am addressing this right now.

jrgriffiniii commented 1 week ago

Service was restored by restarting DSpace with and invocation of dsrestart, followed by the termination of several idle PostgreSQL connections:

dataspace_prod_db=> select pid,state,datname from pg_stat_activity where datname = 'dataspace_prod_db';
   pid   | state  |      datname
---------+--------+-------------------
 1356835 | idle   | dataspace_prod_db
 1357105 | active | dataspace_prod_db
 1356836 | idle   | dataspace_prod_db
 1356837 | idle   | dataspace_prod_db
 1357145 | idle   | dataspace_prod_db
 1357188 | idle   | dataspace_prod_db
 1357112 | active | dataspace_prod_db
(7 rows)

dataspace_prod_db=> select pg_terminate_backend(1357188);
 pg_terminate_backend
----------------------
 t
(1 row)