puppetlabs / puppetdb

Centralized Puppet Storage
http://docs.puppetlabs.com/puppetdb
Apache License 2.0
301 stars 226 forks source link

Garbage Collector is very agressive and kills some queries without waiting. #3977

Open sharewax opened 6 months ago

sharewax commented 6 months ago

Describe the Bug

I observe in logs of puppetdb that GC kills some connections to postgres. We have main node located in EU with settings (currently it runs GC every day, was 1 hour and for reproduce more frequient I can set it to less value):

node-ttl = 13d
node-purge-ttl = 14d
node-purge-gc-batch-limit = 50
gc-interval = 1440

and more nodes puppetdb which the most far located in Singapore with RTT 183 ms with settings to the same database in EU

gc-interval = 0
migrate = false
maximum-pool-size = 30
conn-max-age = 27

So the issue - I observe drops of connection to postgres by GC even without some respectful waiting during phase Starting sweep of stale reports and resource events:

69848-2024-05-09T05:53:43.280Z INFO  [p.p.c.services] Starting sweep of stale nodes (threshold: 13 days)
69850-2024-05-09T05:53:43.348Z INFO  [p.p.c.services] Finished sweep of stale nodes (threshold: 13 days)
69851-2024-05-09T05:53:43.353Z INFO  [p.p.c.services] Starting purge deactivated and expired nodes (threshold: 14 days)
69854-2024-05-09T05:53:43.867Z INFO  [p.p.c.services] Finished purge deactivated and expired nodes (threshold: 14 days)
69855-2024-05-09T05:53:43.889Z INFO  [p.p.c.services] Starting sweep of stale reports (threshold: 14 days) and resource events (threshold: 14 days)
69857:2024-05-09T05:53:44.112Z INFO  [p.p.s.storage] Partition GC terminated queries from the following PostgreSQL pids: [21453]
69858:2024-05-09T05:53:44.319Z INFO  [p.p.s.storage] Partition GC terminated queries from the following PostgreSQL pids: [23047]
69859:2024-05-09T05:53:44.523Z INFO  [p.p.s.storage] Partition GC terminated queries from the following PostgreSQL pids: [16534]
69860-2024-05-09T05:53:44.530Z INFO  [p.p.c.services] Finished sweep of stale reports (threshold: 14 days) and resource events (threshold: 14 days)
69861-2024-05-09T05:53:44.534Z INFO  [p.p.c.services] Starting gc packages
69862-2024-05-09T05:53:44.538Z INFO  [p.p.c.services] Finished gc packages
69863-2024-05-09T05:53:44.540Z INFO  [p.p.c.services] Starting sweep of stale catalog data

So it kills 3 connection in time less than half of second (300 ms according log). this connections are actively sent something from remote location (grep by pid is below):

{"timestamp":"2024-05-09 05:53:43.982 UTC","user":"puppetdb","dbname":"puppetdb","pid":21453,"remote_host":"10.169.2.254","remote_port":51616,"session_id":"663c603d.53cd","line_num":2438,"ps":"BIND","session_start":"2024-05-09 05:33:49 UTC","vxid":"133/20249","txid":115886112,"error_severity":"LOG","message":"duration: 0.085 ms  bind S_15: SELECT producer_timestamp FROM factsets WHERE certname = $1 ORDER BY producer_timestamp DESC LIMIT $2","detail":"parameters: $1 = 'sg2-sdb-a11.be.core.pw', $2 = '1'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":-7343572058853576852}
{"timestamp":"2024-05-09 05:53:43.982 UTC","user":"puppetdb","dbname":"puppetdb","pid":21453,"remote_host":"10.169.2.254","remote_port":51616,"session_id":"663c603d.53cd","line_num":2439,"ps":"SELECT","session_start":"2024-05-09 05:33:49 UTC","vxid":"133/20249","txid":115886112,"error_severity":"LOG","message":"duration: 0.096 ms  execute S_15: SELECT producer_timestamp FROM factsets WHERE certname = $1 ORDER BY producer_timestamp DESC LIMIT $2","detail":"parameters: $1 = 'sg2-sdb-a11.be.core.pw', $2 = '1'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":0}
{"timestamp":"2024-05-09 05:53:44.106 UTC","user":"puppetdb","dbname":"puppetdb","pid":21453,"remote_host":"10.169.2.254","remote_port":51616,"session_id":"663c603d.53cd","line_num":2440,"ps":"idle in transaction","session_start":"2024-05-09 05:33:49 UTC","vxid":"133/20249","txid":115886112,"error_severity":"FATAL","state_code":"57P01","message":"terminating connection due to administrator command","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":0}
{"timestamp":"2024-05-09 05:51:55.377 UTC","user":"puppetdb","dbname":"puppetdb","pid":23047,"remote_host":"10.169.2.254","remote_port":46560,"session_id":"663c606c.5a07","line_num":1215,"ps":"BIND","session_start":"2024-05-09 05:34:36 UTC","vxid":"111/39546","txid":0,"error_severity":"LOG","message":"duration: 0.097 ms  bind <unnamed>: insert into certnames (certname) values ($1)  on conflict (certname) do update set deactivated=null, expired=null  where (certnames.deactivated < $2 or certnames.expired < $3)","detail":"parameters: $1 = 'sg2-sl-b60.be.core.pw', $2 = '2024-05-09 05:51:48.896+00', $3 = '2024-05-09 05:51:48.896+00'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":0}
{"timestamp":"2024-05-09 05:51:56.377 UTC","user":"puppetdb","dbname":"puppetdb","pid":23047,"remote_host":"10.169.2.254","remote_port":46560,"session_id":"663c606c.5a07","line_num":1216,"ps":"INSERT waiting","session_start":"2024-05-09 05:34:36 UTC","vxid":"111/39546","txid":0,"error_severity":"LOG","message":"process 23047 still waiting for ShareLock on transaction 115885248 after 1000.154 ms","detail":"Process holding the lock: 16534. Wait queue: 23047.","context":"while locking tuple (33,96) in relation \"certnames\"","statement":"insert into certnames (certname) values ($1)  on conflict (certname) do update set deactivated=null, expired=null  where (certnames.deactivated < $2 or certnames.expired < $3)","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":8056160586392606418}
{"timestamp":"2024-05-09 05:53:44.319 UTC","user":"puppetdb","dbname":"puppetdb","pid":23047,"remote_host":"10.169.2.254","remote_port":46560,"session_id":"663c606c.5a07","line_num":1217,"ps":"INSERT waiting","session_start":"2024-05-09 05:34:36 UTC","vxid":"111/39546","txid":0,"error_severity":"FATAL","state_code":"57P01","message":"terminating connection due to administrator command","context":"while locking tuple (33,96) in relation \"certnames\"","statement":"insert into certnames (certname) values ($1)  on conflict (certname) do update set deactivated=null, expired=null  where (certnames.deactivated < $2 or certnames.expired < $3)","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":8056160586392606418}
{"timestamp":"2024-05-09 05:53:44.378 UTC","user":"puppetdb","dbname":"puppetdb","pid":16534,"remote_host":"10.169.2.254","remote_port":62478,"session_id":"663c5fca.4096","line_num":4582,"ps":"BIND","session_start":"2024-05-09 05:31:54 UTC","vxid":"142/114381","txid":115885248,"error_severity":"LOG","message":"duration: 0.034 ms  bind S_30: INSERT INTO fact_paths ( path, value_type_id, path_array, name, depth ) VALUES ( $1, $2, $3, $4, $5 ) ON CONFLICT do nothing\nRETURNING *","detail":"parameters: $1 = 'processors#~models#~4', $2 = '0', $3 = '{processors,models,4}', $4 = 'processors', $5 = '2'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":0}
{"timestamp":"2024-05-09 05:53:44.378 UTC","user":"puppetdb","dbname":"puppetdb","pid":16534,"remote_host":"10.169.2.254","remote_port":62478,"session_id":"663c5fca.4096","line_num":4583,"ps":"INSERT","session_start":"2024-05-09 05:31:54 UTC","vxid":"142/114381","txid":115885248,"error_severity":"LOG","message":"duration: 0.101 ms  execute S_30: INSERT INTO fact_paths ( path, value_type_id, path_array, name, depth ) VALUES ( $1, $2, $3, $4, $5 ) ON CONFLICT do nothing\nRETURNING *","detail":"parameters: $1 = 'processors#~models#~4', $2 = '0', $3 = '{processors,models,4}', $4 = 'processors', $5 = '2'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":4553857851137822237}
{"timestamp":"2024-05-09 05:53:44.522 UTC","user":"puppetdb","dbname":"puppetdb","pid":16534,"remote_host":"10.169.2.254","remote_port":62478,"session_id":"663c5fca.4096","line_num":4584,"ps":"idle in transaction","session_start":"2024-05-09 05:31:54 UTC","vxid":"142/114381","txid":115885248,"error_severity":"FATAL","state_code":"57P01","message":"terminating connection due to administrator command","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":4553857851137822237}

Expected Behavior

I suppose that GC should wait for something and drops queries only of something really terrible happens. Maybe be more patient or expand in logs why this connections to postgres should be terminated during GC. As for my side GC shouldn't be destructive.

Steps to Reproduce

I can reproduce in with every run GC with 2 puppetdb which are located quite far away (for example add by tc some delay between nodes)

Environment

sharewax commented 6 months ago

Some metrics from puppetdb: Singapure:

# curl -s -G -H 'Accept: application/json'  --cacert /etc/puppetlabs/puppet/ssl/certs/ca.pem --cert /etc/puppetlabs/puppet/ssl/certs/$HOSTNAME.pem --key /etc/puppetlabs/puppet/ssl/private_keys/$HOSTNAME.pem https://$HOSTNAME:8081/metrics/v2/read/puppetlabs.puppetdb.database:name=PDBWritePool.pool.Wait | jq .
{
  "request": {
    "mbean": "puppetlabs.puppetdb.database:name=PDBWritePool.pool.Wait",
    "type": "read"
  },
  "value": {
    "75thPercentile": 183.023337,
    "Mean": 124.2630601327565,
    "StdDev": 85.63176999191273,
    "98thPercentile": 185.45835,
    "RateUnit": "events/second",
    "95thPercentile": 184.71381,
    "99thPercentile": 187.21545799999998,
    "Max": 190.855177,
    "Count": 220904,
    "FiveMinuteRate": 0.8528680707338815,
    "50thPercentile": 182.96111299999998,
    "MeanRate": 0.7860066475101025,
    "Min": 0.0010249999999999999,
    "OneMinuteRate": 0.9908548663105827,
    "DurationUnit": "milliseconds",
    "999thPercentile": 187.21545799999998,
    "FifteenMinuteRate": 0.7954050920023567
  },
  "timestamp": 1715284731,
  "status": 200
}

EU (local):

# curl -s -G -H 'Accept: application/json'  --cacert /etc/puppetlabs/puppet/ssl/certs/ca.pem --cert /etc/puppetlabs/puppet/ssl/certs/$HOSTNAME.pem --key /etc/puppetlabs/puppet/ssl/private_keys/$HOSTNAME.pem https://$HOSTNAME:8081/metrics/v2/read/puppetlabs.puppetdb.database:name=PDBWritePool.pool.Wait | jq .
{
  "request": {
    "mbean": "puppetlabs.puppetdb.database:name=PDBWritePool.pool.Wait",
    "type": "read"
  },
  "value": {
    "75thPercentile": 0.29408,
    "Mean": 0.13560441251423003,
    "StdDev": 0.24515121048637722,
    "98thPercentile": 0.431164,
    "RateUnit": "events/second",
    "95thPercentile": 0.39854,
    "99thPercentile": 0.451405,
    "Max": 8.078914,
    "Count": 457080,
    "FiveMinuteRate": 3.2667841655103405,
    "50thPercentile": 0.0018239999999999999,
    "MeanRate": 3.332225393092325,
    "Min": 0.00048199999999999995,
    "OneMinuteRate": 3.1529805478178483,
    "DurationUnit": "milliseconds",
    "999thPercentile": 2.589778,
    "FifteenMinuteRate": 3.3267313956147135
  },
  "timestamp": 1715284792,
  "status": 200
}
austb commented 6 months ago

What Postgres version are you using?

sharewax commented 6 months ago

15.6

austb commented 6 months ago

Ok, that's a new enough version that you are getting the best behavior we have (PG 14 or greater).

Some background: GC of reports is very fast now that we have partitioned tables because we can drop whole tables at a time, but the database locks it acquires to do that are more blocking than deleting a row. The result of this is that once PuppetDB's GC has started to wait for a lock, we can block both commands and queries that come in afterwards until both the in-flight queries and the GC are complete. Since we don't know how long all in-flight user queries will take, we could be blocking future commands and queries indefinitely. To solve that we have chosen to rely on this "bulldozer" that allows us to get the heavily locking GC work done as quickly as possible.

Both queries and command processing automatically retry failed PG operations, so just because the error shows up in the PG log does not mean that an attempt to query actually failed. If the user query fails, we would consider that a bug, and you should also see the error logged in PuppetDB's logs then.

sharewax commented 6 months ago

Yes, on Singapure puppetdb I see the next things

zcat puppetdb-2024-05-09.0.log.gz | rg FATAL -A 200
org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:356)
    at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:496)
    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:413)
    at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
    at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:134)
    at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52)
    at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java)
    at clojure.java.jdbc$execute_query_with_params.invokeStatic(jdbc.clj:1090)
    at clojure.java.jdbc$execute_query_with_params.invoke(jdbc.clj:1084)
    at clojure.java.jdbc$db_query_with_resultset_STAR_.invokeStatic(jdbc.clj:1106)
    at clojure.java.jdbc$db_query_with_resultset_STAR_.invoke(jdbc.clj:1093)
    at clojure.java.jdbc$db_query_with_resultset.invokeStatic(jdbc.clj:1140)
    at clojure.java.jdbc$db_query_with_resultset.invoke(jdbc.clj:1119)
    at clojure.java.jdbc$db_query_with_resultset.invokeStatic(jdbc.clj:1127)
    at clojure.java.jdbc$db_query_with_resultset.invoke(jdbc.clj:1119)
    at puppetlabs.puppetdb.jdbc$query_with_resultset.invokeStatic(jdbc.clj:236)
    at puppetlabs.puppetdb.jdbc$query_with_resultset.invoke(jdbc.clj:227)
    at puppetlabs.puppetdb.scf.storage$fn__27708$certname_factset_metadata__27713$fn__27714.invoke(storage.clj:1172)
    at puppetlabs.puppetdb.scf.storage$fn__27708$certname_factset_metadata__27713.invoke(storage.clj:1165)
    at puppetlabs.puppetdb.scf.storage$fn__27782$update_facts_BANG___27787$fn__27791$fn__27793.invoke(storage.clj:1268)
    at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:860)
    at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:776)
    at puppetlabs.puppetdb.scf.storage$fn__27782$update_facts_BANG___27787$fn__27791.invoke(storage.clj:1266)
    at puppetlabs.puppetdb.scf.storage$fn__27782$update_facts_BANG___27787.invoke(storage.clj:1260)
    at puppetlabs.puppetdb.scf.storage$fn__28110$replace_facts_BANG___28115$fn__28119$fn__28121.invoke(storage.clj:1965)
    at puppetlabs.puppetdb.scf.storage.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:101)
    at puppetlabs.puppetdb.scf.storage$fn__28110$replace_facts_BANG___28115$fn__28119.invoke(storage.clj:1962)
    at puppetlabs.puppetdb.scf.storage$fn__28110$replace_facts_BANG___28115.invoke(storage.clj:1955)
    at puppetlabs.puppetdb.command$do_replace_facts.invokeStatic(command.clj:401)
    at puppetlabs.puppetdb.command$do_replace_facts.invoke(command.clj:399)
    at puppetlabs.puppetdb.command$exec_replace_facts$fn__32666.invoke(command.clj:409)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection$fn__22104$fn__22117.invoke(jdbc.clj:553)
    at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:807)
    at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:776)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection$fn__22104.invoke(jdbc.clj:549)
    at puppetlabs.puppetdb.jdbc$retry_sql$attempt__22084.invoke(jdbc.clj:497)
    at puppetlabs.puppetdb.jdbc$retry_sql.invokeStatic(jdbc.clj:507)
    at puppetlabs.puppetdb.jdbc$retry_sql.invoke(jdbc.clj:488)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection.invokeStatic(jdbc.clj:546)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection.invoke(jdbc.clj:541)
    at puppetlabs.puppetdb.command$exec_replace_facts.invokeStatic(command.clj:406)
    at puppetlabs.puppetdb.command$exec_replace_facts.invoke(command.clj:403)
    at puppetlabs.puppetdb.command$exec_command.invokeStatic(command.clj:525)
    at puppetlabs.puppetdb.command$exec_command.invoke(command.clj:510)
    at puppetlabs.puppetdb.command$attempt_exec_command$fn__32843.invoke(command.clj:611)
    at puppetlabs.puppetdb.command$call_with_quick_retry$fn__32832.invoke(command.clj:591)
    at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:589)
    at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:587)
    at puppetlabs.puppetdb.command$attempt_exec_command.invokeStatic(command.clj:609)
    at puppetlabs.puppetdb.command$attempt_exec_command.invoke(command.clj:605)
    at puppetlabs.puppetdb.command$process_cmd$fn__32899$fn__32900.invoke(command.clj:732)
    at puppetlabs.puppetdb.command$process_cmd$fn__32899.invoke(command.clj:729)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317$fn__26318.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:101)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317$fn__26318.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:101)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:26)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:15)
    at puppetlabs.puppetdb.command$process_cmd.invokeStatic(command.clj:726)
    at puppetlabs.puppetdb.command$process_cmd.invoke(command.clj:712)
    at puppetlabs.puppetdb.command$process_message.invokeStatic(command.clj:819)
    at puppetlabs.puppetdb.command$process_message.invoke(command.clj:775)
    at puppetlabs.puppetdb.command$message_handler$fn__32933.invoke(command.clj:853)
    at puppetlabs.puppetdb.threadpool$dochan$fn__32514$fn__32515.invoke(threadpool.clj:115)
    at puppetlabs.puppetdb.threadpool$gated_execute$fn__32475.invoke(threadpool.clj:68)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
2024-05-09T05:53:44.412Z WARN  [c.z.h.p.ProxyConnection] PDBWritePool - Connection org.postgresql.jdbc.PgConnection@2ecb6f29 marked as broken because of SQLSTATE(57P01), ErrorCode(0)
java.sql.BatchUpdateException: Batch entry 0 insert into certnames (certname) values ('sg2-sl-b60.be.core.pw')  on conflict (certname) do update set deactivated=null, expired=null  where (certnames.deactivated < '2024-05-09 05:51:48.896+00' or certnames.expired < '2024-05-09 05:51:48.896+00') was aborted: FATAL: terminating connection due to administrator command
  Where: while locking tuple (33,96) in relation "certnames"  Call getNextException to see other errors in the batch.
    at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:165)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2367)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:560)
    at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:893)
    at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:916)
    at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1684)
    at com.zaxxer.hikari.pool.ProxyStatement.executeBatch(ProxyStatement.java:128)
    at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeBatch(HikariProxyPreparedStatement.java)
    at clojure.java.jdbc$execute_batch.invokeStatic(jdbc.clj:598)
    at clojure.java.jdbc$execute_batch.invoke(jdbc.clj:591)
    at clojure.java.jdbc$db_do_execute_prepared_statement$fn__20458.invoke(jdbc.clj:1057)
    at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:860)
    at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:776)
    at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:789)
    at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:776)
    at clojure.java.jdbc$db_do_execute_prepared_statement.invokeStatic(jdbc.clj:1056)
    at clojure.java.jdbc$db_do_execute_prepared_statement.invoke(jdbc.clj:1042)
    at clojure.java.jdbc$db_do_prepared.invokeStatic(jdbc.clj:1080)
    at clojure.java.jdbc$db_do_prepared.invoke(jdbc.clj:1060)
    at puppetlabs.puppetdb.jdbc$do_prepared.invokeStatic(jdbc.clj:72)
    at puppetlabs.puppetdb.jdbc$do_prepared.doInvoke(jdbc.clj:67)
    at clojure.lang.RestFn.invoke(RestFn.java:423)
    at puppetlabs.puppetdb.scf.storage$fn__28023$maybe_activate_node_BANG___28028$fn__28029.invoke(storage.clj:1898)
    at puppetlabs.puppetdb.scf.storage$fn__28023$maybe_activate_node_BANG___28028.invoke(storage.clj:1890)
    at puppetlabs.puppetdb.command$do_replace_catalog.invokeStatic(command.clj:344)
    at puppetlabs.puppetdb.command$do_replace_catalog.invoke(command.clj:343)
    at puppetlabs.puppetdb.command$exec_replace_catalog$fn__32626.invoke(command.clj:353)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection$fn__22104$fn__22117.invoke(jdbc.clj:553)
    at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:807)
    at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:776)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection$fn__22104.invoke(jdbc.clj:549)
    at puppetlabs.puppetdb.jdbc$retry_sql$attempt__22084.invoke(jdbc.clj:497)
    at puppetlabs.puppetdb.jdbc$retry_sql.invokeStatic(jdbc.clj:507)
    at puppetlabs.puppetdb.jdbc$retry_sql.invoke(jdbc.clj:488)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection.invokeStatic(jdbc.clj:546)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection.invoke(jdbc.clj:541)
    at puppetlabs.puppetdb.command$exec_replace_catalog.invokeStatic(command.clj:350)
    at puppetlabs.puppetdb.command$exec_replace_catalog.invoke(command.clj:347)
    at puppetlabs.puppetdb.command$exec_command.invokeStatic(command.clj:525)
    at puppetlabs.puppetdb.command$exec_command.invoke(command.clj:510)
    at puppetlabs.puppetdb.command$attempt_exec_command$fn__32843.invoke(command.clj:611)
    at puppetlabs.puppetdb.command$call_with_quick_retry$fn__32832.invoke(command.clj:591)
    at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:589)
    at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:587)
    at puppetlabs.puppetdb.command$attempt_exec_command.invokeStatic(command.clj:609)
    at puppetlabs.puppetdb.command$attempt_exec_command.invoke(command.clj:605)
    at puppetlabs.puppetdb.command$process_cmd$fn__32899$fn__32900.invoke(command.clj:732)
    at puppetlabs.puppetdb.command$process_cmd$fn__32899.invoke(command.clj:729)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317$fn__26318.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:101)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317$fn__26318.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:101)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:26)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:15)
    at puppetlabs.puppetdb.command$process_cmd.invokeStatic(command.clj:726)
    at puppetlabs.puppetdb.command$process_cmd.invoke(command.clj:712)
    at puppetlabs.puppetdb.command$process_message.invokeStatic(command.clj:819)
    at puppetlabs.puppetdb.command$process_message.invoke(command.clj:775)
    at puppetlabs.puppetdb.command$message_handler$fn__32933.invoke(command.clj:853)
    at puppetlabs.puppetdb.threadpool$dochan$fn__32514$fn__32515.invoke(threadpool.clj:115)
    at puppetlabs.puppetdb.threadpool$gated_execute$fn__32475.invoke(threadpool.clj:68)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command
  Where: while locking tuple (33,96) in relation "certnames"
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
    ... 67 common frames omitted
2024-05-09T05:53:44.617Z WARN  [c.z.h.p.ProxyConnection] PDBWritePool - Connection org.postgresql.jdbc.PgConnection@78c2b0c0 marked as broken because of SQLSTATE(57P01), ErrorCode(0)
org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:356)
    at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:496)
    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:413)
    at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
    at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:152)
    at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61)
    at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java)
    at clojure.java.jdbc$db_do_execute_prepared_return_keys$exec_and_return_keys__20429.invoke(jdbc.clj:979)
    at clojure.java.jdbc$db_do_execute_prepared_return_keys.invokeStatic(jdbc.clj:1007)
    at clojure.java.jdbc$db_do_execute_prepared_return_keys.invoke(jdbc.clj:963)
    at clojure.java.jdbc$db_do_prepared_return_keys.invokeStatic(jdbc.clj:1038)
    at clojure.java.jdbc$db_do_prepared_return_keys.invoke(jdbc.clj:1015)
    at puppetlabs.puppetdb.jdbc$multi_insert_helper$fn__21900.invoke(jdbc.clj:161)
    at clojure.core$map$fn__5935.invoke(core.clj:2772)
    at clojure.lang.LazySeq.sval(LazySeq.java:42)
    at clojure.lang.LazySeq.seq(LazySeq.java:51)
    at clojure.lang.Cons.next(Cons.java:39)
    at clojure.lang.RT.next(RT.java:713)
    at clojure.core$next__5451.invokeStatic(core.clj:64)
    at clojure.core$dorun.invokeStatic(core.clj:3143)
    at clojure.core$doall.invokeStatic(core.clj:3149)
    at clojure.core$doall.invoke(core.clj:3149)
    at puppetlabs.puppetdb.jdbc$multi_insert_helper.invokeStatic(jdbc.clj:161)
    at puppetlabs.puppetdb.jdbc$multi_insert_helper.invoke(jdbc.clj:155)
    at puppetlabs.puppetdb.jdbc$insert_helper$fn__21903.invoke(jdbc.clj:170)
    at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:860)
    at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:776)
    at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:789)
    at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:776)
    at puppetlabs.puppetdb.jdbc$insert_helper.invokeStatic(jdbc.clj:170)
    at puppetlabs.puppetdb.jdbc$insert_helper.invoke(jdbc.clj:164)
    at puppetlabs.puppetdb.jdbc$insert_rows_BANG_.invokeStatic(jdbc.clj:189)
    at puppetlabs.puppetdb.jdbc$insert_rows_BANG_.invoke(jdbc.clj:173)
    at puppetlabs.puppetdb.jdbc$insert_multi_BANG_.invokeStatic(jdbc.clj:207)
    at puppetlabs.puppetdb.jdbc$insert_multi_BANG_.invoke(jdbc.clj:197)
    at puppetlabs.puppetdb.scf.storage$realize_paths$fn__27738.invoke(storage.clj:1216)
    at clojure.core$map$fn__5935.invoke(core.clj:2772)
    at clojure.lang.LazySeq.sval(LazySeq.java:42)
    at clojure.lang.LazySeq.seq(LazySeq.java:51)
    at clojure.lang.RT.seq(RT.java:535)
    at clojure.core$seq__5467.invokeStatic(core.clj:139)
    at clojure.core$dorun.invokeStatic(core.clj:3134)
    at clojure.core$dorun.invoke(core.clj:3134)
    at puppetlabs.puppetdb.scf.storage$realize_paths.invokeStatic(storage.clj:1212)
    at puppetlabs.puppetdb.scf.storage$realize_paths.invoke(storage.clj:1203)
    at puppetlabs.puppetdb.scf.storage$realize_paths.invokeStatic(storage.clj:1207)
    at puppetlabs.puppetdb.scf.storage$realize_paths.invoke(storage.clj:1203)
    at puppetlabs.puppetdb.scf.storage$fn__27782$update_facts_BANG___27787$fn__27791$fn__27793.invoke(storage.clj:1310)
    at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:860)
    at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:776)
    at puppetlabs.puppetdb.scf.storage$fn__27782$update_facts_BANG___27787$fn__27791.invoke(storage.clj:1266)
    at puppetlabs.puppetdb.scf.storage$fn__27782$update_facts_BANG___27787.invoke(storage.clj:1260)
    at puppetlabs.puppetdb.scf.storage$fn__28110$replace_facts_BANG___28115$fn__28119$fn__28121.invoke(storage.clj:1965)
    at puppetlabs.puppetdb.scf.storage.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:101)
    at puppetlabs.puppetdb.scf.storage$fn__28110$replace_facts_BANG___28115$fn__28119.invoke(storage.clj:1962)
    at puppetlabs.puppetdb.scf.storage$fn__28110$replace_facts_BANG___28115.invoke(storage.clj:1955)
    at puppetlabs.puppetdb.command$do_replace_facts.invokeStatic(command.clj:401)
    at puppetlabs.puppetdb.command$do_replace_facts.invoke(command.clj:399)
    at puppetlabs.puppetdb.command$exec_replace_facts$fn__32666.invoke(command.clj:409)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection$fn__22104$fn__22117.invoke(jdbc.clj:553)
    at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:807)
    at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:776)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection$fn__22104.invoke(jdbc.clj:549)
    at puppetlabs.puppetdb.jdbc$retry_sql$attempt__22084.invoke(jdbc.clj:497)
    at puppetlabs.puppetdb.jdbc$retry_sql.invokeStatic(jdbc.clj:507)
    at puppetlabs.puppetdb.jdbc$retry_sql.invoke(jdbc.clj:488)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection.invokeStatic(jdbc.clj:546)
    at puppetlabs.puppetdb.jdbc$retry_with_monitored_connection.invoke(jdbc.clj:541)
    at puppetlabs.puppetdb.command$exec_replace_facts.invokeStatic(command.clj:406)
    at puppetlabs.puppetdb.command$exec_replace_facts.invoke(command.clj:403)
    at puppetlabs.puppetdb.command$exec_command.invokeStatic(command.clj:525)
    at puppetlabs.puppetdb.command$exec_command.invoke(command.clj:510)
    at puppetlabs.puppetdb.command$attempt_exec_command$fn__32843.invoke(command.clj:611)
    at puppetlabs.puppetdb.command$call_with_quick_retry$fn__32832.invoke(command.clj:591)
    at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:589)
    at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:587)
    at puppetlabs.puppetdb.command$attempt_exec_command.invokeStatic(command.clj:609)
    at puppetlabs.puppetdb.command$attempt_exec_command.invoke(command.clj:605)
    at puppetlabs.puppetdb.command$process_cmd$fn__32899$fn__32900.invoke(command.clj:732)
    at puppetlabs.puppetdb.command$process_cmd$fn__32899.invoke(command.clj:729)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317$fn__26318.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:101)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317$fn__26318.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:101)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__26316$fn__26317.invoke(metrics.clj:23)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:26)
    at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:15)
    at puppetlabs.puppetdb.command$process_cmd.invokeStatic(command.clj:726)
    at puppetlabs.puppetdb.command$process_cmd.invoke(command.clj:712)
    at puppetlabs.puppetdb.command$process_message.invokeStatic(command.clj:819)
    at puppetlabs.puppetdb.command$process_message.invoke(command.clj:775)
    at puppetlabs.puppetdb.command$message_handler$fn__32933.invoke(command.clj:853)
    at puppetlabs.puppetdb.threadpool$dochan$fn__32514$fn__32515.invoke(threadpool.clj:115)
    at puppetlabs.puppetdb.threadpool$gated_execute$fn__32475.invoke(threadpool.clj:68)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

We also have an issue about if a lot of resources are uploaded to puppetdb (discussion on Slack), but currently I solved it by reducing frequeintly of GC (once a day instead once a hour) and switched puppet agent to puppetserver in Luxemburg (apply catalog has been changed from 90 seconds to 110 but it solves the issue with uploading catalog into puppetdb).

I can provide more logs from postgres or puppetdb in this case. In DB we saw some Locks

[root@ed-sl-b166 pg_log]# rg 'ShareLock|ExclusiveLock' postgresql-2024-05-10_000000.json -c
2536
[root@ed-sl-b166 pg_log]# rg 'ShareLock|ExclusiveLock' postgresql-2024-05-09_000000.json -c
1607
austb commented 6 months ago

Ah sorry, forgot about that log spew from HikariCP. Those are expected warnings from our connection pool because to terminate the query consistently we have to close the entire connection.

The error that would indicate a command failed would be

Exceeded max attempts ({2}) for {3} {4}

I would need to double check our retry semantics to be 100% sure, but I think it would also be problematic if you saw

Retrying after attempt {2} for {3}, due to: {4} {5}

because that would mean our "fast retry" failed as well.

sharewax commented 6 months ago

Ah, so if I have a WARN - it will be retried even in logs I see FATAL. So main is to have ERROR and in this case it will be. So, if I turn back my hosts with a big amount of resources back to puppetserver and puppetdb with high RTT - I can reproduce the issue, because now I didn't see errors, but in previous days it was like https://gist.github.com/sharewax/171b16848dedf65f30b64862634f2fdc

like host with big amount of resouces in catalog can't update during time and GC kills every hour this query.