Closed traxanos closed 3 years ago
@traxanos thanks for the bug report, we may need to gather some additional information from you, let me discuss with the team and circle back with you.
Hi i have try to create partial dump, but in this partial the oom isn't reproducable. and the original dump contain sensitive data and is very very large ...
Ok no worries there may be some timescale/pg logs we can look at that should not contain any sensitive info stay tuned as we consider next steps!!
do you discussed the problem with your team? any new info how can support you?
do you have new informations for us?
I also have this bug and I may be able to find a reproducible scenario. Do you want a full DB dump or something less?
Same situation with clean Postgres, without timescale.
Sending one request with ~900 lines like:
update partitioned_table set a=1 where id=3;
update partitioned_table set a=2 where id=4;
...
kills the DB. Even analyze explain does it, not actual update.
Experiencing this problem on AWS RDS Postgres11
Hello, we had transformed TREND tables into traditional tables instead of hypertables and the issue is gone for now. We are waiting for the exact solution to move them into hypertables again. Z4.4 P11.6
Hi, I migrated this weekend from a Zabbix MySQL setup to TimeScaleDB and I am experiencing the same problem - but with the trends_uint table.
During the migration I dropped the primary key in the migration process and unfortunately ran parts of the migration a second time.
While trying to remove the duplicates from the trends_uint table by using
DELETE FROM public.trends_uint T1
USING public.trends_uint T2
WHERE T1.ctid < T2.ctid
AND T1.itemid = T2.itemid
and T1.clock = T2.clock;
the table got locked and the INSERT statements coming from the zabbix server process got queued.
After canceling above statement zabbix server tries to INSERT the held up entries and then the memory get's used up and I get an OOM kill.
2020-03-15 23:14:57.129 UTC [604] WARNING: there is no transaction in progress
2020-03-15 23:16:12.615 UTC [1] LOG: server process (PID 578) was terminated by signal 9
2020-03-15 23:16:12.615 UTC [1] DETAIL: Failed process was running: update trends_uint set num=29,value_min=1217753088,value_avg=1277251019,value_max=1336111104 where itemid=200200000030428 and clock=1584309600;
update trends_uint set num=119,value_min=4,value_avg=4,value_max=4 where itemid=200200000398365 and clock=1584309600;
update trends_uint set num=120,value_min=1,value_avg=1,value_max=1 where itemid=200200000579962 and clock=1584309600;
update trends_uint set num=120,value_min=0,value_avg=0,value_max=0 where itemid=200200000579994 and clock=1584309600;
update trends_uint set num=120,value_min=1,value_avg=1,value_max=1 where itemid=200200000580026 and clock=1584309600;
update trends_uint set num=120,value_min=0,value_avg=0,value_max=0 where itemid=200200000580078 and clock=1584309600;
update trends_uint set num=120,value_min=0,value_avg=0,value_max=0 where itemid=200200000580079 and clock=1584309600;
update trends_uint set num=120,value_min=0,value_avg=0,value_max=0 where itemid=200200000580090 and clock=1584309600;
update trends_uint set num=120,value_min=1,value_avg=
2020-03-15 23:16:12.616 UTC [1] LOG: terminating any other active server processes
2020-03-15 23:16:12.616 UTC [626] WARNING: terminating connection because of crash of another server process
2020-03-15 23:16:12.616 UTC [626] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[16239.266165] postgres invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[16239.266167] postgres cpuset=ef71009861c9d05e29679ead151d91e2901a77bf753508702dcc773c3dccf247 mems_allowed=0
[16239.266181] CPU: 1 PID: 20854 Comm: postgres Tainted: P OE 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u2
[16239.266182] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[16239.266183] Call Trace:
[16239.266194] dump_stack+0x5c/0x80
[16239.266199] dump_header+0x6b/0x283
[16239.266209] ? do_try_to_free_pages+0x2ec/0x370
[16239.266213] oom_kill_process.cold.30+0xb/0x1cf
[16239.266215] ? oom_badness+0xe9/0x140
[16239.266218] out_of_memory+0x1a5/0x430
[16239.266220] __alloc_pages_slowpath+0xbd8/0xcb0
[16239.266224] __alloc_pages_nodemask+0x28b/0x2b0
[16239.266227] filemap_fault+0x3bd/0x780
[16239.266235] ? filemap_map_pages+0x139/0x3a0
[16239.266239] __do_fault+0x36/0x130
[16239.266254] __handle_mm_fault+0xe6c/0x1270
[16239.266261] handle_mm_fault+0xd6/0x200
[16239.266264] __do_page_fault+0x249/0x4f0
[16239.266287] ? ksys_lseek+0x62/0xb0
[16239.266291] ? async_page_fault+0x8/0x30
[16239.266292] async_page_fault+0x1e/0x30
[16239.266331] RIP: 0033:0x55d620552610
[16239.266337] Code: Bad RIP value.
[16239.266338] RSP: 002b:00007fffb4b95ba8 EFLAGS: 00010202
[16239.266340] RAX: 00007fd146dd8890 RBX: 00007fd147d99740 RCX: 0000000000000000
[16239.266341] RDX: 00007fd146dd88a8 RSI: 0000000000000001 RDI: 00007fd146dd8890
[16239.266342] RBP: 0000000000000001 R08: 00007fd3710fb4f0 R09: 00007fd3710fb4d0
[16239.266343] R10: 00007fd147d9a368 R11: 0000000000000246 R12: 00007fd370ccbdb8
[16239.266344] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fd370ccb688
[16239.266346] Mem-Info:
[16239.266351] active_anon:5393045 inactive_anon:388189 isolated_anon:32
active_file:103 inactive_file:0 isolated_file:0
unevictable:683 dirty:0 writeback:0 unstable:0
slab_reclaimable:15928 slab_unreclaimable:78860
mapped:141496 shmem:200195 pagetables:21681 bounce:0
free:41906 free_pcp:0 free_cma:0
[16239.266354] Node 0 active_anon:21572180kB inactive_anon:1552756kB active_file:412kB inactive_file:0kB unevictable:2732kB isolated(anon):128kB isolated(file):0kB mapped:565984kB dirty:0kB writeback:0kB shmem:800780kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 40960kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[16239.266355] Node 0 DMA free:15908kB min:40kB low:52kB high:64kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[16239.266360] lowmem_reserve[]: 0 2917 24034 24034 24034
[16239.266364] Node 0 DMA32 free:92472kB min:8196kB low:11180kB high:14164kB active_anon:2902112kB inactive_anon:104kB active_file:64kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129196kB managed:3043408kB mlocked:0kB kernel_stack:160kB pagetables:6708kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[16239.266369] lowmem_reserve[]: 0 0 21117 21117 21117
[16239.266373] Node 0 Normal free:59244kB min:59340kB low:80964kB high:102588kB active_anon:18668588kB inactive_anon:1551400kB active_file:0kB inactive_file:604kB unevictable:2732kB writepending:0kB present:22020096kB managed:21624148kB mlocked:0kB kernel_stack:7792kB pagetables:80016kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[16239.266384] lowmem_reserve[]: 0 0 0 0 0
[16239.266388] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[16239.266405] Node 0 DMA32: 301*4kB (UME) 295*8kB (UME) 325*16kB (UME) 223*32kB (UME) 67*64kB (UME) 39*128kB (UME) 40*256kB (UME) 39*512kB (UM) 37*1024kB (UME) 0*2048kB 0*4096kB = 93276kB
[16239.266421] Node 0 Normal: 79*4kB (UME) 41*8kB (UME) 97*16kB (UM) 18*32kB (M) 55*64kB (M) 390*128kB (M) 19*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 61076kB
[16239.266443] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[16239.266444] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[16239.266445] 212081 total pagecache pages
[16239.266450] 10932 pages in swap cache
[16239.266452] Swap cache stats: add 19760750, delete 19753628, find 1100161/1419572
[16239.266452] Free swap = 0kB
[16239.266453] Total swap = 16776188kB
[16239.266454] 6291321 pages RAM
[16239.266455] 0 pages HighMem/MovableOnly
[16239.266455] 120455 pages reserved
[16239.266456] 0 pages hwpoisoned
[16239.266457] Tasks state (memory values in pages):
[16239.266458] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[16239.266468] [ 273] 0 273 767 1 45056 26 0 udhcpc
[16239.266470] [ 307] 0 307 767 1 40960 26 0 udhcpc
[16239.266473] [ 597] 0 597 6010 0 69632 355 -1000 systemd-udevd
[16239.266481] [ 3200] 0 3200 20332 235 192512 1340 0 systemd-journal
[16239.266483] [ 3237] 0 3237 2372 1 53248 324 0 dhclient
[16239.266486] [ 3275] 145 3275 23215 0 81920 221 0 systemd-timesyn
[16239.266488] [ 3358] 0 3358 4858 1 69632 243 0 systemd-logind
[16239.266491] [ 3359] 0 3359 1608 23 53248 36 0 qemu-ga
[16239.266493] [ 3361] 0 3361 1677 20 49152 31 0 nscd
[16239.266495] [ 3363] 103 3363 2994 72 65536 109 -900 dbus-daemon
[16239.266497] [ 3364] 0 3364 73955 12 90112 469 0 rsyslogd
[16239.266499] [ 3368] 0 3368 39562 2 69632 128 0 zed
[16239.266501] [ 3373] 0 3373 581 1 45056 34 0 acpid
[16239.266503] [ 3374] 0 3374 2122 14 57344 59 0 cron
[16239.266505] [ 3385] 0 3385 481148 1453 389120 6733 0 containerd
[16239.266507] [ 3394] 0 3394 1351 0 53248 44 0 atd
[16239.266509] [ 3395] 109 3395 4680 28 73728 257 0 dnsmasq
[16239.266511] [ 3452] 114 3452 97198 0 122880 1298 -1000 nslcd
[16239.266513] [ 3570] 0 3570 597 0 45056 23 0 sshguard
[16239.266515] [ 3571] 0 3571 597 1 45056 23 0 sshguard
[16239.266517] [ 3572] 0 3572 979 2 45056 27 0 sshg-parser
[16239.266519] [ 3573] 0 3573 17046 64 159744 186 0 journalctl
[16239.266521] [ 3574] 0 3574 2727 0 40960 110 0 sshg-blocker
[16239.266523] [ 3575] 0 3575 597 1 45056 23 0 sshguard
[16239.266525] [ 3576] 0 3576 597 1 45056 25 0 sshg-fw-nft-set
[16239.266528] [ 3684] 22502 3684 6623 13 94208 284 0 zabbix_agentd
[16239.266530] [ 3721] 0 3721 3942 23 73728 193 -1000 sshd
[16239.266536] [ 3761] 22502 3761 6623 153 94208 269 0 zabbix_agentd
[16239.266539] [ 3762] 22502 3762 6657 12 90112 317 0 zabbix_agentd
[16239.266540] [ 3763] 22502 3763 6657 32 90112 297 0 zabbix_agentd
[16239.266542] [ 3764] 22502 3764 6657 25 90112 305 0 zabbix_agentd
[16239.266544] [ 3766] 22502 3766 6657 25 90112 304 0 zabbix_agentd
[16239.266547] [ 3767] 22502 3767 6657 23 90112 306 0 zabbix_agentd
[16239.266548] [ 3768] 22502 3768 6657 2 90112 329 0 zabbix_agentd
[16239.266550] [ 3769] 22502 3769 6657 16 90112 314 0 zabbix_agentd
[16239.266554] [ 3770] 22502 3770 6657 40 90112 289 0 zabbix_agentd
[16239.266556] [ 3771] 22502 3771 6657 11 90112 319 0 zabbix_agentd
[16239.266558] [ 3772] 22502 3772 6657 8 90112 321 0 zabbix_agentd
[16239.266560] [ 3773] 22502 3773 6763 208 94208 342 0 zabbix_agentd
[16239.266562] [ 4065] 0 4065 10867 20 77824 190 0 master
[16239.266565] [ 4109] 108 4109 11635 10 90112 192 0 qmgr
[16239.266567] [ 4722] 0 4722 641 0 45056 29 0 agetty
[16239.266570] [ 5974] 0 5974 5256 1 90112 316 0 sshd
[16239.266572] [ 5988] 20224 5988 5740 0 77824 448 0 systemd
[16239.266574] [ 5990] 20224 5990 44419 1 110592 897 0 (sd-pam)
[16239.266576] [ 6010] 20224 6010 5256 0 86016 335 0 sshd
[16239.266579] [ 6011] 20224 6011 2667 1 65536 828 0 bash
[16239.266581] [ 6145] 0 6145 3953 0 69632 208 0 sudo
[16239.266583] [ 6151] 0 6151 2666 1 57344 920 0 bash
[16239.266586] [ 11530] 0 11530 2144 0 53248 66 0 screen
[16239.266588] [ 11532] 0 11532 2221 0 49152 175 0 screen
[16239.266591] [ 11534] 0 11534 2481 0 57344 715 0 bash
[16239.266593] [ 21703] 0 21703 5256 1 77824 318 0 sshd
[16239.266595] [ 21709] 20224 21709 5256 0 73728 349 0 sshd
[16239.266597] [ 21710] 20224 21710 2667 1 61440 828 0 bash
[16239.266599] [ 21881] 0 21881 3953 0 73728 208 0 sudo
[16239.266601] [ 21885] 0 21885 2765 1 65536 1200 0 bash
[16239.266604] [ 25673] 108 25673 11109 0 73728 205 0 pickup
[16239.266605] [ 416] 108 416 11110 0 81920 206 0 showq
[16239.266609] [ 11054] 0 11054 438013 471 483328 11944 -500 dockerd
[16239.266613] [ 12923] 0 12923 358596 0 376832 9974 0 docker
[16239.266616] [ 12956] 0 12956 358660 0 372736 8946 0 docker
[16239.266619] [ 13485] 0 13485 137180 0 114688 1371 -500 docker-proxy
[16239.266623] [ 13492] 0 13492 26924 116 69632 914 -999 containerd-shim
[16239.266626] [ 13506] 0 13506 26924 154 69632 861 -999 containerd-shim
[16239.266629] [ 13528] 5050 13528 6237 188 94208 3999 0 gunicorn
[16239.266631] [ 13534] 70 13534 1598453 35329 397312 206 0 postgres
[16239.266634] [ 13715] 0 13715 714 0 45056 62 0 master
[16239.266636] [ 13716] 100 13716 712 0 45056 43 0 pickup
[16239.266645] [ 13717] 100 13717 724 0 45056 41 0 qmgr
[16239.266647] [ 13774] 5050 13774 26796 536 258048 19965 0 gunicorn
[16239.266649] [ 15467] 0 15467 340227 0 368640 7119 0 docker
[16239.266651] [ 20613] 70 20613 1598453 17 94208 236 0 postgres
[16239.266652] [ 20614] 70 20614 1598489 12089 212992 221 0 postgres
[16239.266654] [ 20615] 70 20615 1598453 4165 126976 226 0 postgres
[16239.266656] [ 20616] 70 20616 1598564 220 147456 266 0 postgres
[16239.266658] [ 20617] 70 20617 11518 68 77824 214 0 postgres
[16239.266660] [ 20618] 70 20618 1598534 184 143360 273 0 postgres
[16239.266662] [ 20619] 70 20619 1598532 0 118784 306 0 postgres
[16239.266664] [ 20620] 70 20620 1598986 944 212992 540 0 postgres
[16239.266666] [ 20621] 70 20621 1598986 1102 212992 536 0 postgres
[16239.266668] [ 20637] 70 20637 4222263 1347732 21577728 1282963 0 postgres
[16239.266670] [ 20691] 70 20691 4060362 1379043 20291584 1095924 0 postgres
[16239.266671] [ 20805] 70 20805 3875951 1461398 18739200 823175 0 postgres
[16239.266673] [ 20854] 70 20854 3869835 1415478 18767872 868652 0 postgres
[16239.266676] [ 20931] 70 20931 1607228 44795 1392640 193 0 postgres
[16239.266680] [ 20993] 70 20993 1599540 96876 1466368 98 0 postgres
[16239.266682] Out of memory: Kill process 20637 (postgres) score 254 or sacrifice child
[16239.268315] Killed process 20637 (postgres) total-vm:16889052kB, anon-rss:5357136kB, file-rss:0kB, shmem-rss:33792kB
[16239.798933] oom_reaper: reaped process 20637 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:33792kB
postgres.conf has been configured using timescaledb-tune and all set to recommended values.
No idea where to look further from here.
Heja there ... we have the exact same problem here ( just with a way less smaller db ). Also with zabbix. Converted the tables back to postgres native for now.
zabbix=> \dx
List of installed extensions
Name | Version | Schema | Description
--------------------+---------+------------+-------------------------------------------------------------------
timescaledb | 1.3.2 | public | Enables scalable inserts and complex queries for time-series data
(5 rows)
zabbix=> SELECT version();
version
---------------------------------------------------------------------------------------------------------
PostgreSQL 11.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)
zabbix=>
I investigated this deeply. It seems it all related to some PostgreSQL issue, On clean Postgres 10,11 with partitioning. If trends also partitioned by day, and when you reach 350-400 partitions you can face such an OOM crash. SO solution at least for zabbix could use bigger partitions by week or month, then you never face this.
Interesting. We had 32 days for history and 365 for trends. I don't know how much chunks we had for trends but even with history we would have had 5
I may test that theory once I got back to work after the current situation. You may loose some performance optimization from build in timescale functions.
Hmm... Will post some thoughts o er at Zabbix issue too.
Same here Centos 7 (latest updates) Zabbix 4.4.7 from repo.zabbix.com postgresql11-server-11.7-1PGDG.rhel7.x86_64 timescaledb_11-1.6.1-1.rhel7.x86_64
The Zabbix team are working on a workaround -> https://support.zabbix.com/browse/ZBX-16347 The of course cant fix it as the problem seams to be in Timescale or even in Postgres but they seam to highlight the problem in Docs and change the partition timespans for the trend tables.
@bboule you from Timescale team? Have you had any insights on this problem maybe for postgres itself or other reports too?
According to Zabbix community, issue confirmed on Centos, Ubuntu and Debian distributions, probably, there are no others using this particular feature yet, therefore, I assume, it's not an issue with specific build...
Touching lots of partitions/chunks during UPDATEs can lead to OOM. PostgreSQL 12 should improve the situations as it prunes subtables during deletes and updates. Note that TimescaleDB doesn't specifically optimize updates right now.
Closing since this issue doesn't seem specific to TimescaleDB.
Hello! I do not think this issue should be closed. We have same problem with DELETE statements in pg13.3 and ts2.3.1. I wrote the details of our issue here: https://github.com/timescale/timescaledb/issues/1085
Relevant system information:
postgres --version
): 11.6\dx
inpsql
): 1.4.1 upto 1.5.1Describe the bug I use zabbix with timescale. For writing trends, the zabbix server use multiple UPDATE statements in a transaction. In this case the postgresql run into an oom. I have started with 1.4.1 and have manual update up to 1.5.1.
To Reproduce when you send update many line during a transaction the oom occured.
Expected behavior server should run without crash
Actual behavior the server growing up for 50GB memory and run into oom
Additional context you can find the zabbix bug report here: https://support.zabbix.com/browse/ZBX-16347