[Bug]: OOM killer invoked on simple delete query

BilalS10 commented 2 years ago

What type of bug is this?

Crash

What subsystems and features are affected?

Other

What happened?

OOM killer is invoked whenever I am running consecutive delete queries on timescaledb. On further investigation I found that 2 postgres processes are consuming around 8 gb of memory and hence the kernel kills one of these processes (check kernel logs processes "24835" & "24876"). I have attached the kernel logs as well as timescaledb docker logs herewith. Also, the delete query that I am running is deleting around 50k points across 10 years.

AWS instance configurations : vcpus -> 4 RAM -> 16GB Disk space -> 512GB

TSDB Configurations : Chunk size -> 1 day

Postgresql configurations : shared_buffers = 2GB work_mem = 10485kB maintenance_work_mem = 1GB effective_io_concurrency = 256 max_worker_processes = 13 max_parallel_workers_per_gather = 1 max_parallel_workers = 2 max_wal_size = 1GB min_wal_size = 80MB effective_cache_size = 6GB autovacuum_max_workers = 10 autovacuum_naptime = 10 max_locks_per_transaction = 1024

TimescaleDB version affected

2.5.2

PostgreSQL version used

13.5

What operating system did you use?

Debian GNU/Linux 10 (buster)

What installation method did you use?

Docker

What platform did you run on?

Amazon Web Services (AWS)

Relevant log output and stack trace

Kernel Logs :

Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290876.896288] Tasks state (memory values in pages):
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290876.901568] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.182273] [   2905]   101  2905     9830      813    73728        0             0 nginx
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.190550] [  24835]    70 24835  1825441  1762470 14548992        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.199584] [  24876]    70 24876  1514675  1456875 12034048        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.209052] [  24887]     0 24887     1932       63    57344        0             0 cron
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.218177] [  24888]    70 24888    65420     1817   180224        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.226878] [  24889]    70 24889    65895     2047   188416        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.235222] [  24891]    70 24891    65430     1832   176128        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.243376] [  24892]    70 24892    65367     1696   184320        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.251879] [  24893]    70 24893    65367     1541   167936        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.260383] [  24894]     0 24894      596       31    49152        0             0 dhclient-script
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.326065] [  24895]    70 24895    65360     1212   163840        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.335250] [  24896]    70 24896    65274     1136   159744        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.344474] [  24897]     0 24897     1469       17    49152        0             0 nginx
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.353459] [  24899]     0 24899     1101       11    40960        0             0 certbot
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.362530] [  24900]    70 24900    65258     1080   159744        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.371780] [  24901]     0 24901     2217       33    53248        0             0 sshd
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.380819] [  24902]     0 24902      541       16    40960        0             0 run-parts
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.389888] [  24903]    70 24903    65125      340    94208        0             0 postgres
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.398187] Out of memory: Kill process 24835 (postgres) score 433 or sacrifice child
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.406349] Killed process 24835 (postgres) total-vm:7301764kB, anon-rss:6898160kB, file-rss:0kB, shmem-rss:151720kB
Apr 12 07:34:00 ip-172-31-16-144 kernel: [2290878.747646] oom_reaper: reaped process 24835 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:151720kB

timscaledb logs :

timescaledb           | 2022-04-11 20:04:31.570 UTC [4594] HINT:  The most up-to-date version is 2.6.0, the installed version is 2.5.2.
timescaledb           | 2022-04-12 07:06:33.496 UTC [23] LOG:  checkpoints are occurring too frequently (4 seconds apart)
timescaledb           | 2022-04-12 07:06:33.496 UTC [23] HINT:  Consider increasing the configuration parameter "max_wal_size".
timescaledb           | 2022-04-12 07:34:00.671 UTC [1] LOG:  server process (PID 12777) was terminated by signal 9: Killed
timescaledb           | 2022-04-12 07:34:00.671 UTC [1] DETAIL:  Failed process was running: delete from sensor_data where component_id = 125 and ((sensor_id = 77 and sensor_position LIKE '0') OR (sensor_id = 76 and sensor_position LIKE '0')) and time between '2021-01-01 00:00:00.000' and '2031-10-28 18:40:00.000'
timescaledb           | 2022-04-12 07:34:00.768 UTC [1] LOG:  terminating any other active server processes
timescaledb           | 2022-04-12 07:34:00.769 UTC [12793] WARNING:  terminating connection because of crash of another server process
timescaledb           | 2022-04-12 07:34:00.769 UTC [12793] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

How can we reproduce the bug?

Try consecutive delete queries to the db. (I face this issue after around 10-15 queries

horzsolt commented 2 years ago

HI @BilalS10 thanks for describing this issue, which is quite a tricky one. In order to be able to reproduce the issue we would like to do the exact same things you did, therefore we would need the followings:

hypertable definition
continuous aggregate definition
delete statement that you used
the explain output of the delete statement

BilalS10 commented 2 years ago

Sure @horzsolt here are the details

hypertable definition -> CREATE TABLE "sensor_data"( time TIMESTAMP WITHOUT TIME ZONE NOT NULL, component_id INTEGER, sensor_id INTEGER, sensor_position TEXT, value DOUBLE PRECISION ); SELECT create_hypertable('sensor_data', 'time', chunk_time_interval => 86400000000); CREATE INDEX ON "sensor_data"(component_id , sensor_id, sensor_position, time DESC);
CAGG definition -> No CAGG used.
delete statement -> delete from sensor_data where component_id = 125 and ((sensor_id = 77 and sensor_position LIKE '0') OR (sensor_id = 76 and sensor_position LIKE '0')) and time between '2021-01-01 00:00:00.000' and '2031-10-28 18:40:00.000'
explain output of delete statement -> have attached the explain output file here explain-delete-statement.csv

horzsolt commented 2 years ago

Thank you @BilalS10 sending these details, from the explain output it can be seen that there were many chunks involved in the deletion, which can answer the question of the high memory usage.

Can you please try to do the data removal using smaller time periods in the delete statements, for example 1 year instead of 10 years. What we would like to check if this can be a possible workaround, until we have investigated further the original issues. thank you

BilalS10 commented 2 years ago

got it.. ill try this and revert with the findings.

NunoFilipeSantos commented 2 years ago

Hi @BilalS10 ! Do you have any new findings?

BilalS10 commented 2 years ago

Hi @NunoFilipeSantos

I have implemented a solution as discussed above (deleting in chunks). Still testing it on our development server to see if it is working fine. The initial result does look good though when I tried with a 1 year time range. Will share more details once I test it completely with different test cases.

BilalS10 commented 2 years ago

Hi @NunoFilipeSantos

I implemented the solution on one of my test instances and faced the same issue again. An oom killer was invoked during the erase query. The query was deleting data for a range of 4 months.. here are the details :

timescaledb | 2022-05-09 20:04:59.545 UTC [32163] HINT: The most up-to-date version is 2.6.1, the installed version is 2.5.2. timescaledb | 2022-05-10 19:16:58.195 UTC [1] LOG: server process (PID 16302) was terminated by signal 9: Killed timescaledb | 2022-05-10 19:16:58.195 UTC [1] DETAIL: Failed process was running: delete from sensor_data where component_id = 6 and ((sensor_id = 35 and sensor_position LIKE '0') OR (sensor_id = 36 and sensor_position LIKE '0') OR (sensor_id = 37 and sensor_position LIKE '0') OR (sensor_id = 38 and sensor_position LIKE '0') OR (sensor_id = 39 and sensor_position LIKE '0') OR (sensor_id = 40 and sensor_position LIKE '0') OR (sensor_id = 41 and sensor_position LIKE '0') OR (sensor_id = 60 and sensor_position LIKE '0') OR (sensor_id = 61 and sensor_position LIKE '0') OR (sensor_id = 62 and sensor_position LIKE '0') OR (sensor_id = 63 and sensor_position LIKE '0') OR (sensor_id = 42 and sensor_position LIKE '0') OR (sensor_id = 42 and sensor_position LIKE '1') OR (sensor_id = 44 and sensor_position LIKE '0') OR (sensor_id = 44 and sensor_position LIKE '1') OR (sensor_id = 44 and sensor_position LIKE '2') OR (sensor_id = 44 and sensor_position LIKE '3')) and time between '2021-07-01 00:00:00.000' and '2021-10-31 23:59:00.000' The no of rows that exist in the db for the same query is 5931314 .

So now, even though i am deleting in batches, the erase query fails..

Do you think, the no of rows is also playing a part here?

BilalS10 commented 2 years ago

Update : I tried reducing the batch size to months and weeks but it still crashes.. the only difference was for 1 week, atleast the first batch went through and it crashed on the 2nd batch.. Have also upgraded my database to the latest one (2.6.1) and the issue persists..

BilalS10 commented 2 years ago

Hi @horzsolt , were you able to investigate this at your end?

BilalS10 commented 2 years ago

is this issue related -> https://github.com/timescale/timescaledb/issues/2486

horzsolt commented 2 years ago

Hi @BilalS10 , TBH, I'm struggling with reproducing your case, unfortunately. I used the same setup and settings as you, generated the same amount of data, executed the same delete statements and I didn't see any issues. It would be great if I can test having the same or similar data in the sensor_data table as you. Do you have an points, ideas how I can generate very similar data that you have? thank you

BilalS10 commented 2 years ago

Hi @horzsolt , I believe the amount of data in the database affects the query performance. We have around 150 GB of data in that table and maybe thats the reason for high consumption. try creating many random unique ids (combination of component_id, sensor_id, sensor_position) and then insert data. then perform the delete operation. It should be similar to what we have in our database

BilalS10 commented 2 years ago

Also a quick update, I increased the AWS instance size to vcpus -> 8, RAM -> 32GB .. Along with this, used timescaledb tune to perform memory tuning for 4 cpus and 16gb RAM. (before i configured for 2 vcpus, 8 GB)

This time the query went through for 1 year time range and I observed the maximum memory consumption was around 15/16 GB.. does that explain why the instance crashed when the available RAM was only 16 GB ? However, my concern is should it be even this much for a delete query?

horzsolt commented 2 years ago

Yes, the amount of available physical memory explains why the OOM killer was triggered. However, the memory consumption of this operation seems to be too much. When we did the same delete operation trying to repro the issue, the memory consumption was minimal.

Can you please let us know how many rows your delete stmt deleted actually? Also, can you please try to replace the LIKE with = in your delete statement and see if there is any difference? LIKE is a pattern matching operator, therefor more complex than a simple comparison and we've seen in some other cases that there can be differences in execution.

BilalS10 commented 2 years ago

No of rows that deleted -> 4328337 (query for 4 months)
Sure, will change the "LIKE" to "=" and run the query..

akuzm commented 2 years ago

This is on 2.6.1, right? I thought it should have these memory problems fixed...

At the first glance, the work_mem of 10 GB might be a little too large if you have 16 GB RAM total, and also spend some GBs on buffers, and have some concurrent connections. If you reduce it, say, to 1GB, does it OOM less?

To get more insights into memory usage, you could attach with gdb to the backend running the offending query, and type call MemoryContextStats(TopMemoryContext). This should print some info about memory usage to the server log. It should look like this:

TopMemoryContext: 83080 total in 6 blocks; 7008 free (18 chunks); 76072 used
  Custom Scan Methods: 16384 total in 2 blocks; 6656 free (3 chunks); 9728 used
  RowDescriptionContext: 8192 total in 1 blocks; 6888 free (0 chunks); 1304 used
  ....

This info might help us debug this issue. If you take it, please also run bt in gdb and include the result as well.

svenklemm commented 2 years ago

Can you try to upgrade to PG14 i could reproduce a similar behaviour on PG13 but the memory allocations seem to be inside postgres code.

BilalS10 commented 2 years ago

@akuzm the work mem is 10 MB, not GB . Ill try getting more memory insight using the method that you mentioned and revert.

@svenklemm understood.. Ill install pg14/timescaledb 2.6.1 on a test instance and see if it solves the issues

Thank you @akuzm @svenklemm for your inputs

justinpryzby commented 2 years ago

See postgres commit 86dc90056 which updated this part of the docs for v14. https://www.postgresql.org/docs/13/ddl-partitioning.html It is important to consider the overhead of partitioning during query planning and execution. The query planner is generally able to handle partition hierarchies with up to a few thousand partitions fairly well, provided that typical queries allow the query planner to prune all but a small number of partitions. Planning times become longer and memory consumption becomes higher when more partitions remain after the planner performs partition pruning. This is particularly true for the UPDATE and DELETE commands. Another reason to be concerned about having a large number of partitions is that the server's memory consumption may grow significantly over time, especially if many sessions touch large numbers of partitions. That's because each partition requires its metadata to be loaded into the local memory of each session that touches it.

BilalS10 commented 2 years ago

@justinpryzby does that mean, for timescaledb, having more no of chunks will affect the performance?

justinpryzby commented 2 years ago

Yes - I think so. Note that the docs I referenced are referring to native "declarative" partitions, which timescaledb doesn't use. So I think the warning in postgresql v13 docs would still apply. Have you checked the behavior with pg14 ?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale due to lack of activity. You can remove the stale label or comment. Otherwise, this issue will be closed in 30 days. Thank you!

github-actions[bot] commented 2 years ago

We are closing this issue due to lack of activity. Feel free to add a comment to this issue if you can provide more information and we will re-open it. Thank you!

timescale / timescaledb