Closed BilalS10 closed 2 years ago
HI @BilalS10 thanks for describing this issue, which is quite a tricky one. In order to be able to reproduce the issue we would like to do the exact same things you did, therefore we would need the followings:
Sure @horzsolt here are the details
hypertable definition ->
CREATE TABLE "sensor_data"( time TIMESTAMP WITHOUT TIME ZONE NOT NULL, component_id INTEGER, sensor_id INTEGER, sensor_position TEXT, value DOUBLE PRECISION ); SELECT create_hypertable('sensor_data', 'time', chunk_time_interval => 86400000000); CREATE INDEX ON "sensor_data"(component_id , sensor_id, sensor_position, time DESC);
CAGG definition -> No CAGG used.
delete statement ->
delete from sensor_data where component_id = 125 and ((sensor_id = 77 and sensor_position LIKE '0') OR (sensor_id = 76 and sensor_position LIKE '0')) and time between '2021-01-01 00:00:00.000' and '2031-10-28 18:40:00.000'
explain output of delete statement -> have attached the explain output file here explain-delete-statement.csv
Thank you @BilalS10 sending these details, from the explain output it can be seen that there were many chunks involved in the deletion, which can answer the question of the high memory usage.
Can you please try to do the data removal using smaller time periods in the delete statements, for example 1 year instead of 10 years. What we would like to check if this can be a possible workaround, until we have investigated further the original issues. thank you
got it.. ill try this and revert with the findings.
Hi @BilalS10 ! Do you have any new findings?
Hi @NunoFilipeSantos
I have implemented a solution as discussed above (deleting in chunks). Still testing it on our development server to see if it is working fine. The initial result does look good though when I tried with a 1 year time range. Will share more details once I test it completely with different test cases.
Hi @NunoFilipeSantos
I implemented the solution on one of my test instances and faced the same issue again. An oom killer was invoked during the erase query. The query was deleting data for a range of 4 months.. here are the details :
timescaledb | 2022-05-09 20:04:59.545 UTC [32163] HINT: The most up-to-date version is 2.6.1, the installed version is 2.5.2. timescaledb | 2022-05-10 19:16:58.195 UTC [1] LOG: server process (PID 16302) was terminated by signal 9: Killed timescaledb | 2022-05-10 19:16:58.195 UTC [1] DETAIL: Failed process was running: delete from sensor_data where component_id = 6 and ((sensor_id = 35 and sensor_position LIKE '0') OR (sensor_id = 36 and sensor_position LIKE '0') OR (sensor_id = 37 and sensor_position LIKE '0') OR (sensor_id = 38 and sensor_position LIKE '0') OR (sensor_id = 39 and sensor_position LIKE '0') OR (sensor_id = 40 and sensor_position LIKE '0') OR (sensor_id = 41 and sensor_position LIKE '0') OR (sensor_id = 60 and sensor_position LIKE '0') OR (sensor_id = 61 and sensor_position LIKE '0') OR (sensor_id = 62 and sensor_position LIKE '0') OR (sensor_id = 63 and sensor_position LIKE '0') OR (sensor_id = 42 and sensor_position LIKE '0') OR (sensor_id = 42 and sensor_position LIKE '1') OR (sensor_id = 44 and sensor_position LIKE '0') OR (sensor_id = 44 and sensor_position LIKE '1') OR (sensor_id = 44 and sensor_position LIKE '2') OR (sensor_id = 44 and sensor_position LIKE '3')) and time between '2021-07-01 00:00:00.000' and '2021-10-31 23:59:00.000'
The no of rows that exist in the db for the same query is 5931314 .
So now, even though i am deleting in batches, the erase query fails..
Do you think, the no of rows is also playing a part here?
Update : I tried reducing the batch size to months and weeks but it still crashes.. the only difference was for 1 week, atleast the first batch went through and it crashed on the 2nd batch.. Have also upgraded my database to the latest one (2.6.1) and the issue persists..
Hi @horzsolt , were you able to investigate this at your end?
is this issue related -> https://github.com/timescale/timescaledb/issues/2486
Hi @BilalS10 , TBH, I'm struggling with reproducing your case, unfortunately. I used the same setup and settings as you, generated the same amount of data, executed the same delete statements and I didn't see any issues. It would be great if I can test having the same or similar data in the sensor_data
table as you. Do you have an points, ideas how I can generate very similar data that you have? thank you
Hi @horzsolt , I believe the amount of data in the database affects the query performance. We have around 150 GB of data in that table and maybe thats the reason for high consumption. try creating many random unique ids (combination of component_id, sensor_id, sensor_position) and then insert data. then perform the delete operation. It should be similar to what we have in our database
Also a quick update, I increased the AWS instance size to vcpus -> 8, RAM -> 32GB .. Along with this, used timescaledb tune to perform memory tuning for 4 cpus and 16gb RAM. (before i configured for 2 vcpus, 8 GB)
This time the query went through for 1 year time range and I observed the maximum memory consumption was around 15/16 GB.. does that explain why the instance crashed when the available RAM was only 16 GB ? However, my concern is should it be even this much for a delete query?
Yes, the amount of available physical memory explains why the OOM killer was triggered. However, the memory consumption of this operation seems to be too much. When we did the same delete operation trying to repro the issue, the memory consumption was minimal.
Can you please let us know how many rows your delete stmt deleted actually?
Also, can you please try to replace the LIKE with =
in your delete statement and see if there is any difference? LIKE is a pattern matching operator, therefor more complex than a simple comparison and we've seen in some other cases that there can be differences in execution.
This is on 2.6.1, right? I thought it should have these memory problems fixed...
At the first glance, the work_mem of 10 GB might be a little too large if you have 16 GB RAM total, and also spend some GBs on buffers, and have some concurrent connections. If you reduce it, say, to 1GB, does it OOM less?
To get more insights into memory usage, you could attach with gdb to the backend running the offending query, and type call MemoryContextStats(TopMemoryContext)
. This should print some info about memory usage to the server log. It should look like this:
TopMemoryContext: 83080 total in 6 blocks; 7008 free (18 chunks); 76072 used
Custom Scan Methods: 16384 total in 2 blocks; 6656 free (3 chunks); 9728 used
RowDescriptionContext: 8192 total in 1 blocks; 6888 free (0 chunks); 1304 used
....
This info might help us debug this issue. If you take it, please also run bt
in gdb and include the result as well.
Can you try to upgrade to PG14 i could reproduce a similar behaviour on PG13 but the memory allocations seem to be inside postgres code.
@akuzm the work mem is 10 MB, not GB . Ill try getting more memory insight using the method that you mentioned and revert.
@svenklemm understood.. Ill install pg14/timescaledb 2.6.1 on a test instance and see if it solves the issues
Thank you @akuzm @svenklemm for your inputs
See postgres commit 86dc90056 which updated this part of the docs for v14. https://www.postgresql.org/docs/13/ddl-partitioning.html It is important to consider the overhead of partitioning during query planning and execution. The query planner is generally able to handle partition hierarchies with up to a few thousand partitions fairly well, provided that typical queries allow the query planner to prune all but a small number of partitions. Planning times become longer and memory consumption becomes higher when more partitions remain after the planner performs partition pruning. This is particularly true for the UPDATE and DELETE commands. Another reason to be concerned about having a large number of partitions is that the server's memory consumption may grow significantly over time, especially if many sessions touch large numbers of partitions. That's because each partition requires its metadata to be loaded into the local memory of each session that touches it.
@justinpryzby does that mean, for timescaledb, having more no of chunks will affect the performance?
Yes - I think so. Note that the docs I referenced are referring to native "declarative" partitions, which timescaledb doesn't use. So I think the warning in postgresql v13 docs would still apply. Have you checked the behavior with pg14 ?
This issue has been automatically marked as stale due to lack of activity. You can remove the stale label or comment. Otherwise, this issue will be closed in 30 days. Thank you!
We are closing this issue due to lack of activity. Feel free to add a comment to this issue if you can provide more information and we will re-open it. Thank you!
What type of bug is this?
Crash
What subsystems and features are affected?
Other
What happened?
OOM killer is invoked whenever I am running consecutive delete queries on timescaledb. On further investigation I found that 2 postgres processes are consuming around 8 gb of memory and hence the kernel kills one of these processes (check kernel logs processes "24835" & "24876"). I have attached the kernel logs as well as timescaledb docker logs herewith. Also, the delete query that I am running is deleting around 50k points across 10 years.
AWS instance configurations : vcpus -> 4 RAM -> 16GB Disk space -> 512GB
TSDB Configurations : Chunk size -> 1 day
Postgresql configurations : shared_buffers = 2GB work_mem = 10485kB maintenance_work_mem = 1GB effective_io_concurrency = 256 max_worker_processes = 13 max_parallel_workers_per_gather = 1 max_parallel_workers = 2 max_wal_size = 1GB min_wal_size = 80MB effective_cache_size = 6GB autovacuum_max_workers = 10 autovacuum_naptime = 10 max_locks_per_transaction = 1024
TimescaleDB version affected
2.5.2
PostgreSQL version used
13.5
What operating system did you use?
Debian GNU/Linux 10 (buster)
What installation method did you use?
Docker
What platform did you run on?
Amazon Web Services (AWS)
Relevant log output and stack trace
How can we reproduce the bug?