Closed tatzlwurm2 closed 9 months ago
So this bug did not happen in 2.5.0 but started with 2.5.2? Did you try 2.5.1 too. Can you show the configuration of your distributed hypertable(s). What triggers this behaviour? How can this be reproduced?
It started in 2.5.0 and I tried 2.5.2 to see if it would fix. It appears to be a bunch of deletes that triggers this issue. The delete's look they are completing normally but yet system crashes. I haven't been able to reproduce this on my own.
hypertable_schema | hypertable_name | owner | num_dimensions | num_chunks | compression_enabled | is_distributed | replication_factor | data_nodes | tablespaces
-------------------+-----------------+-------+----------------+------------+---------------------+----------------+--------------------+-----------------------------------------------+-------------
cds_s | incidentsdata | cds_u | 2 | 469 | f | t | 2 | {cds_node_1,cds_node_2,cds_node_3,cds_node_4} |
cds_s | rtpmdata | cds_u | 2 | 469 | f | t | 2 | {cds_node_1,cds_node_2,cds_node_3,cds_node_4} |
cds_s | statsdata | cds_u | 2 | 743 | f | t | 2 | {cds_node_1,cds_node_2,cds_node_3,cds_node_4} |
cds_s | otherdata | cds_u | 2 | 87 | f | t | 2 | {cds_node_1,cds_node_2,cds_node_3,cds_node_4} |
(4 rows)
cdsdb=# \d cds_s.statsdata
Table "cds_s.statsdata"
Column | Type | Collation | Nullable | Default
-------------------+-----------------------------+-----------+----------+-------------------
environment_name | character varying(64) | | not null |
data_timestamp | timestamp without time zone | | not null |
host_name | character varying(64) | | not null |
process_name | character varying(32) | | not null |
random_uniq_ifier | integer | | not null |
update_timestamp | timestamp without time zone | | | CURRENT_TIMESTAMP
data_value | bytea | | |
data_format | text | | | 'ber'::text
Indexes:
"statsdata_pkey" PRIMARY KEY, btree (environment_name, data_timestamp, host_name, process_name, random_uniq_ifier), tablespace "cdsx_md01"
"statsdata_data_timestamp_idx" btree (data_timestamp DESC), tablespace "cdsx_md01"
"statsdata_environment_name_data_timestamp_idx" btree (environment_name, data_timestamp DESC), tablespace "cdsx_md01"
Check constraints:
"stats_data_format_check" CHECK (data_format = ANY (ARRAY['ber'::text, 'json'::text]))
Triggers:
ts_insert_blocker BEFORE INSERT ON statsdata FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Number of child tables: 743 (Use \d+ to list them.)
Tablespace: "cdsd_md01"
Given this
SELECT create_distributed_hypertable('statsdata', 'data_timestamp','environment_name',chunk_time_interval => INTERVAL '1 day');
The memory leak appears to be on these types of statements:
2022-02-28 01:11:32 EST [2094]: [1208-1] db=cdsdb,user=cds_u LOG: 00000: duration: 417.064 ms bind S_2: delete from incidentsdata where environment_name=$1 and data_timestamp >= $2 and data_timestamp <= $3
2022-02-28 01:11:32 EST [2094]: [1209-1] db=cdsdb,user=cds_u DETAIL: parameters: $1 = 'XXX_prod_audit-statistics-webapp', $2 = '1969-12-31 19:00:00', $3 = '2021-11-29 23:59:59.999'
2022-02-28 01:11:32 EST [2094]: [1210-1] db=cdsdb,user=cds_u LOCATION: exec_bind_message, postgres.c:2025
2022-02-28 01:11:34 EST [2094]: [1211-1] db=cdsdb,user=cds_u LOG: 00000: duration: 900.280 ms bind S_3: delete from statsdata where environment_name=$1 and data_timestamp >= $2 and data_timestamp <= $3
2022-02-28 01:11:34 EST [2094]: [1212-1] db=cdsdb,user=cds_u DETAIL: parameters: $1 = 'XXX_prod_audit-statistics-webapp', $2 = '1969-12-31 19:00:00', $3 = '2021-11-29 23:59:59.999'
....
2022-02-28 04:19:06 EST [112706]: [240-1] db=,user= LOG: 00000: server process (PID 2094) was terminated by signal 9: Killed
From OS.
[Mon Feb 28 04:13:51 2022] Killed process 2094 (postgres), UID 57887, total-vm:19299836kB, anon-rss:14388556kB, file-rss:0kB, shmem-rss:17200kB
The response times look fine but it appears to be cause of the leak. It doesn't make sense that there is a big time gap between last delete to OOM killing that process. Is there a better way to issue deletes based on the above?
So versions before 2.5.0 did not show this behaviour? Which version did you use before 2.5.0? There is indeed a better way to delete data in hypertables with drop_chunks but it has some constraints so you might have to fall back on normal DELETE depending on your situation. https://docs.timescale.com/api/latest/hypertable/drop_chunks/
2.5.0 is the version we started with as we are new to timescaledb. I did look into drop earlier today, but Need to delete based on one of the demensions. IE table's definition is like this: SELECT create_distributed_hypertable('statsdata', 'data_timestamp','environment_name',chunk_time_interval => INTERVAL '1 day');
So the delete that cause the issue is this 2022-03-01 04:19:17 EST [112706]: [258-1] db=,user= DETAIL: Failed process was running: delete from statsdata where environment_name=$1 and data_timestamp >= $2 and data_timestamp <= $3
So if we could set the policy on evironment_name and timestamp then the memory issue most likely won't be there.
If it helps the leak appears to be related to size of the hyper distributed table. I had to purge some data where the leak was occurring, due to corruption last Friday. The leak hasn't occurred since in that environment. Also I haven't observed that behavior in any of our smaller environments either. Only in large sets does it occur.
Any updates? Still happening : [Thu Feb 10 05:07:07 2022] Killed process 19759 (postgres), UID 57887, total-vm:19661448kB, anon-rss:14380492kB, file-rss:0kB, shmem-rss:17456kB [Fri Feb 11 04:16:54 2022] Killed process 92706 (postgres), UID 57887, total-vm:19726820kB, anon-rss:14376796kB, file-rss:0kB, shmem-rss:14560kB [Sat Feb 12 04:16:39 2022] Killed process 62885 (postgres), UID 57887, total-vm:19850512kB, anon-rss:14514244kB, file-rss:0kB, shmem-rss:14740kB [Sun Feb 13 04:17:01 2022] Killed process 21477 (postgres), UID 57887, total-vm:19838228kB, anon-rss:14529932kB, file-rss:0kB, shmem-rss:14380kB [Mon Feb 14 04:26:28 2022] Killed process 130575 (postgres), UID 57887, total-vm:20147584kB, anon-rss:14831064kB, file-rss:0kB, shmem-rss:15000kB [Tue Feb 15 04:16:27 2022] Killed process 95862 (postgres), UID 57887, total-vm:19567520kB, anon-rss:14220152kB, file-rss:0kB, shmem-rss:18860kB [Wed Feb 16 04:26:08 2022] Killed process 58079 (postgres), UID 57887, total-vm:20150216kB, anon-rss:14829152kB, file-rss:0kB, shmem-rss:25160kB [Thu Feb 17 04:20:20 2022] Killed process 20071 (postgres), UID 57887, total-vm:19717384kB, anon-rss:14373256kB, file-rss:0kB, shmem-rss:24752kB [Fri Feb 18 04:16:22 2022] Killed process 106579 (postgres), UID 57887, total-vm:19388164kB, anon-rss:14055676kB, file-rss:0kB, shmem-rss:18712kB [Fri Feb 18 19:16:12 2022] Killed process 69438 (postgres), UID 57887, total-vm:19529264kB, anon-rss:14243780kB, file-rss:24kB, shmem-rss:14332kB [Sat Feb 19 04:16:06 2022] Killed process 75229 (postgres), UID 57887, total-vm:19361956kB, anon-rss:13983328kB, file-rss:0kB, shmem-rss:21944kB [Sun Feb 20 04:16:07 2022] Killed process 29523 (postgres), UID 57887, total-vm:19531512kB, anon-rss:14259448kB, file-rss:0kB, shmem-rss:20396kB [Mon Feb 21 04:25:46 2022] Killed process 837 (postgres), UID 57887, total-vm:20051852kB, anon-rss:14740816kB, file-rss:0kB, shmem-rss:25692kB [Tue Feb 22 04:15:49 2022] Killed process 86831 (postgres), UID 57887, total-vm:19338472kB, anon-rss:14277928kB, file-rss:0kB, shmem-rss:21348kB [Wed Feb 23 04:15:41 2022] Killed process 57688 (postgres), UID 57887, total-vm:19421312kB, anon-rss:14094148kB, file-rss:24kB, shmem-rss:19220kB [Thu Feb 24 04:19:04 2022] Killed process 19800 (postgres), UID 57887, total-vm:19506544kB, anon-rss:14420172kB, file-rss:0kB, shmem-rss:5172kB [Fri Feb 25 04:12:16 2022] Killed process 110640 (postgres), UID 57887, total-vm:19338936kB, anon-rss:14268400kB, file-rss:0kB, shmem-rss:10268kB [Sat Feb 26 04:15:14 2022] Killed process 78107 (postgres), UID 57887, total-vm:19540912kB, anon-rss:14439564kB, file-rss:0kB, shmem-rss:15644kB [Sun Feb 27 04:11:12 2022] Killed process 39510 (postgres), UID 57887, total-vm:19151292kB, anon-rss:14152452kB, file-rss:0kB, shmem-rss:5084kB [Mon Feb 28 04:13:51 2022] Killed process 2094 (postgres), UID 57887, total-vm:19299836kB, anon-rss:14388556kB, file-rss:0kB, shmem-rss:17200kB [Tue Mar 1 04:13:56 2022] Killed process 92115 (postgres), UID 57887, total-vm:19144768kB, anon-rss:14331376kB, file-rss:0kB, shmem-rss:14368kB [Wed Mar 2 04:12:36 2022] Killed process 58847 (postgres), UID 57887, total-vm:19529328kB, anon-rss:14842608kB, file-rss:0kB, shmem-rss:13052kB [Thu Mar 3 04:16:31 2022] Killed process 27628 (postgres), UID 57887, total-vm:19712380kB, anon-rss:14869248kB, file-rss:0kB, shmem-rss:9924kB [Fri Mar 4 04:16:25 2022] Killed process 127062 (postgres), UID 57887, total-vm:19813068kB, anon-rss:14882476kB, file-rss:0kB, shmem-rss:7396kB
Potentially related to: https://github.com/timescale/timescaledb/issues/3862
Note for maintainers: The distributed tables do leak some memory into ExecutorState context, so it's going to grow until the query ends. See fdw_exec_foreign_update_or_delete
-> async_request_set_wait_any_result
, async_request_send_prepared_stmt_with_params
. Arguably they could allocate this memory on the per-tuple context, or free it. I guess it's a bug.
In 2.13.0 we announced the deprecation of multi-node and it will be complete removed from the upcoming 2.14.0.
What type of bug is this?
Crash
What subsystems and features are affected?
Access node
What happened?
We have an application talking to Postgresql 13.5 (On Red Hat Enterprise Linux Server release 7.9 (Maipo)) with the following extensions
We have this configuration deployed to dev/nit/nat – production r1,r2,r3 In R1,2,3 we are now experiencing the following:
R1
R2
R3
Last line in each of the output above is after I upgraded the extension from 2.5.0 to 2.5.2.
TimescaleDB version affected
2.5.2
PostgreSQL version used
13.5
What operating system did you use?
Red Hat Enterprise Linux Server release 7.9 (Maipo)
What installation method did you use?
Source
What platform did you run on?
On prem/Self-hosted
Relevant log output and stack trace
How can we reproduce the bug?