timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.49k stars 878 forks source link

OOM during UPDATE in a transaction. #1569

Closed traxanos closed 3 years ago

traxanos commented 4 years ago

Relevant system information:

Describe the bug I use zabbix with timescale. For writing trends, the zabbix server use multiple UPDATE statements in a transaction. In this case the postgresql run into an oom. I have started with 1.4.1 and have manual update up to 1.5.1.

To Reproduce when you send update many line during a transaction the oom occured.

Expected behavior server should run without crash

Actual behavior the server growing up for 50GB memory and run into oom

Additional context you can find the zabbix bug report here: https://support.zabbix.com/browse/ZBX-16347

bboule commented 4 years ago

@traxanos thanks for the bug report, we may need to gather some additional information from you, let me discuss with the team and circle back with you.

traxanos commented 4 years ago

Hi i have try to create partial dump, but in this partial the oom isn't reproducable. and the original dump contain sensitive data and is very very large ...

bboule commented 4 years ago

Ok no worries there may be some timescale/pg logs we can look at that should not contain any sensitive info stay tuned as we consider next steps!!

traxanos commented 4 years ago

do you discussed the problem with your team? any new info how can support you?

traxanos commented 4 years ago

do you have new informations for us?

overhacked commented 4 years ago

I also have this bug and I may be able to find a reproducible scenario. Do you want a full DB dump or something less?

konstantin-kornienko commented 4 years ago

Same situation with clean Postgres, without timescale.

Sending one request with ~900 lines like:

update partitioned_table set a=1 where id=3;
update partitioned_table set a=2 where id=4;
...

kills the DB. Even analyze explain does it, not actual update.

Experiencing this problem on AWS RDS Postgres11

eertul commented 4 years ago

Hello, we had transformed TREND tables into traditional tables instead of hypertables and the issue is gone for now. We are waiting for the exact solution to move them into hypertables again. Z4.4 P11.6

stumbaumr commented 4 years ago

Hi, I migrated this weekend from a Zabbix MySQL setup to TimeScaleDB and I am experiencing the same problem - but with the trends_uint table.

During the migration I dropped the primary key in the migration process and unfortunately ran parts of the migration a second time.

While trying to remove the duplicates from the trends_uint table by using

DELETE FROM public.trends_uint T1
    USING public.trends_uint T2
WHERE   T1.ctid < T2.ctid
    AND T1.itemid  = T2.itemid
    and T1.clock = T2.clock;

the table got locked and the INSERT statements coming from the zabbix server process got queued.

After canceling above statement zabbix server tries to INSERT the held up entries and then the memory get's used up and I get an OOM kill.

2020-03-15 23:14:57.129 UTC [604] WARNING:  there is no transaction in progress
2020-03-15 23:16:12.615 UTC [1] LOG:  server process (PID 578) was terminated by signal 9
2020-03-15 23:16:12.615 UTC [1] DETAIL:  Failed process was running: update trends_uint set num=29,value_min=1217753088,value_avg=1277251019,value_max=1336111104 where itemid=200200000030428 and clock=1584309600;
    update trends_uint set num=119,value_min=4,value_avg=4,value_max=4 where itemid=200200000398365 and clock=1584309600;
    update trends_uint set num=120,value_min=1,value_avg=1,value_max=1 where itemid=200200000579962 and clock=1584309600;
    update trends_uint set num=120,value_min=0,value_avg=0,value_max=0 where itemid=200200000579994 and clock=1584309600;
    update trends_uint set num=120,value_min=1,value_avg=1,value_max=1 where itemid=200200000580026 and clock=1584309600;
    update trends_uint set num=120,value_min=0,value_avg=0,value_max=0 where itemid=200200000580078 and clock=1584309600;
    update trends_uint set num=120,value_min=0,value_avg=0,value_max=0 where itemid=200200000580079 and clock=1584309600;
    update trends_uint set num=120,value_min=0,value_avg=0,value_max=0 where itemid=200200000580090 and clock=1584309600;
    update trends_uint set num=120,value_min=1,value_avg=
2020-03-15 23:16:12.616 UTC [1] LOG:  terminating any other active server processes
2020-03-15 23:16:12.616 UTC [626] WARNING:  terminating connection because of crash of another server process
2020-03-15 23:16:12.616 UTC [626] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[16239.266165] postgres invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[16239.266167] postgres cpuset=ef71009861c9d05e29679ead151d91e2901a77bf753508702dcc773c3dccf247 mems_allowed=0
[16239.266181] CPU: 1 PID: 20854 Comm: postgres Tainted: P           OE     4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u2
[16239.266182] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[16239.266183] Call Trace:
[16239.266194]  dump_stack+0x5c/0x80
[16239.266199]  dump_header+0x6b/0x283
[16239.266209]  ? do_try_to_free_pages+0x2ec/0x370
[16239.266213]  oom_kill_process.cold.30+0xb/0x1cf
[16239.266215]  ? oom_badness+0xe9/0x140
[16239.266218]  out_of_memory+0x1a5/0x430
[16239.266220]  __alloc_pages_slowpath+0xbd8/0xcb0
[16239.266224]  __alloc_pages_nodemask+0x28b/0x2b0
[16239.266227]  filemap_fault+0x3bd/0x780
[16239.266235]  ? filemap_map_pages+0x139/0x3a0
[16239.266239]  __do_fault+0x36/0x130
[16239.266254]  __handle_mm_fault+0xe6c/0x1270
[16239.266261]  handle_mm_fault+0xd6/0x200
[16239.266264]  __do_page_fault+0x249/0x4f0
[16239.266287]  ? ksys_lseek+0x62/0xb0
[16239.266291]  ? async_page_fault+0x8/0x30
[16239.266292]  async_page_fault+0x1e/0x30
[16239.266331] RIP: 0033:0x55d620552610
[16239.266337] Code: Bad RIP value.
[16239.266338] RSP: 002b:00007fffb4b95ba8 EFLAGS: 00010202
[16239.266340] RAX: 00007fd146dd8890 RBX: 00007fd147d99740 RCX: 0000000000000000
[16239.266341] RDX: 00007fd146dd88a8 RSI: 0000000000000001 RDI: 00007fd146dd8890
[16239.266342] RBP: 0000000000000001 R08: 00007fd3710fb4f0 R09: 00007fd3710fb4d0
[16239.266343] R10: 00007fd147d9a368 R11: 0000000000000246 R12: 00007fd370ccbdb8
[16239.266344] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fd370ccb688
[16239.266346] Mem-Info:
[16239.266351] active_anon:5393045 inactive_anon:388189 isolated_anon:32
                active_file:103 inactive_file:0 isolated_file:0
                unevictable:683 dirty:0 writeback:0 unstable:0
                slab_reclaimable:15928 slab_unreclaimable:78860
                mapped:141496 shmem:200195 pagetables:21681 bounce:0
                free:41906 free_pcp:0 free_cma:0
[16239.266354] Node 0 active_anon:21572180kB inactive_anon:1552756kB active_file:412kB inactive_file:0kB unevictable:2732kB isolated(anon):128kB isolated(file):0kB mapped:565984kB dirty:0kB writeback:0kB shmem:800780kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 40960kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[16239.266355] Node 0 DMA free:15908kB min:40kB low:52kB high:64kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[16239.266360] lowmem_reserve[]: 0 2917 24034 24034 24034
[16239.266364] Node 0 DMA32 free:92472kB min:8196kB low:11180kB high:14164kB active_anon:2902112kB inactive_anon:104kB active_file:64kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129196kB managed:3043408kB mlocked:0kB kernel_stack:160kB pagetables:6708kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[16239.266369] lowmem_reserve[]: 0 0 21117 21117 21117
[16239.266373] Node 0 Normal free:59244kB min:59340kB low:80964kB high:102588kB active_anon:18668588kB inactive_anon:1551400kB active_file:0kB inactive_file:604kB unevictable:2732kB writepending:0kB present:22020096kB managed:21624148kB mlocked:0kB kernel_stack:7792kB pagetables:80016kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[16239.266384] lowmem_reserve[]: 0 0 0 0 0
[16239.266388] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[16239.266405] Node 0 DMA32: 301*4kB (UME) 295*8kB (UME) 325*16kB (UME) 223*32kB (UME) 67*64kB (UME) 39*128kB (UME) 40*256kB (UME) 39*512kB (UM) 37*1024kB (UME) 0*2048kB 0*4096kB = 93276kB
[16239.266421] Node 0 Normal: 79*4kB (UME) 41*8kB (UME) 97*16kB (UM) 18*32kB (M) 55*64kB (M) 390*128kB (M) 19*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 61076kB
[16239.266443] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[16239.266444] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[16239.266445] 212081 total pagecache pages
[16239.266450] 10932 pages in swap cache
[16239.266452] Swap cache stats: add 19760750, delete 19753628, find 1100161/1419572
[16239.266452] Free swap  = 0kB
[16239.266453] Total swap = 16776188kB
[16239.266454] 6291321 pages RAM
[16239.266455] 0 pages HighMem/MovableOnly
[16239.266455] 120455 pages reserved
[16239.266456] 0 pages hwpoisoned
[16239.266457] Tasks state (memory values in pages):
[16239.266458] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[16239.266468] [    273]     0   273      767        1    45056       26             0 udhcpc
[16239.266470] [    307]     0   307      767        1    40960       26             0 udhcpc
[16239.266473] [    597]     0   597     6010        0    69632      355         -1000 systemd-udevd
[16239.266481] [   3200]     0  3200    20332      235   192512     1340             0 systemd-journal
[16239.266483] [   3237]     0  3237     2372        1    53248      324             0 dhclient
[16239.266486] [   3275]   145  3275    23215        0    81920      221             0 systemd-timesyn
[16239.266488] [   3358]     0  3358     4858        1    69632      243             0 systemd-logind
[16239.266491] [   3359]     0  3359     1608       23    53248       36             0 qemu-ga
[16239.266493] [   3361]     0  3361     1677       20    49152       31             0 nscd
[16239.266495] [   3363]   103  3363     2994       72    65536      109          -900 dbus-daemon
[16239.266497] [   3364]     0  3364    73955       12    90112      469             0 rsyslogd
[16239.266499] [   3368]     0  3368    39562        2    69632      128             0 zed
[16239.266501] [   3373]     0  3373      581        1    45056       34             0 acpid
[16239.266503] [   3374]     0  3374     2122       14    57344       59             0 cron
[16239.266505] [   3385]     0  3385   481148     1453   389120     6733             0 containerd
[16239.266507] [   3394]     0  3394     1351        0    53248       44             0 atd
[16239.266509] [   3395]   109  3395     4680       28    73728      257             0 dnsmasq
[16239.266511] [   3452]   114  3452    97198        0   122880     1298         -1000 nslcd
[16239.266513] [   3570]     0  3570      597        0    45056       23             0 sshguard
[16239.266515] [   3571]     0  3571      597        1    45056       23             0 sshguard
[16239.266517] [   3572]     0  3572      979        2    45056       27             0 sshg-parser
[16239.266519] [   3573]     0  3573    17046       64   159744      186             0 journalctl
[16239.266521] [   3574]     0  3574     2727        0    40960      110             0 sshg-blocker
[16239.266523] [   3575]     0  3575      597        1    45056       23             0 sshguard
[16239.266525] [   3576]     0  3576      597        1    45056       25             0 sshg-fw-nft-set
[16239.266528] [   3684] 22502  3684     6623       13    94208      284             0 zabbix_agentd
[16239.266530] [   3721]     0  3721     3942       23    73728      193         -1000 sshd
[16239.266536] [   3761] 22502  3761     6623      153    94208      269             0 zabbix_agentd
[16239.266539] [   3762] 22502  3762     6657       12    90112      317             0 zabbix_agentd
[16239.266540] [   3763] 22502  3763     6657       32    90112      297             0 zabbix_agentd
[16239.266542] [   3764] 22502  3764     6657       25    90112      305             0 zabbix_agentd
[16239.266544] [   3766] 22502  3766     6657       25    90112      304             0 zabbix_agentd
[16239.266547] [   3767] 22502  3767     6657       23    90112      306             0 zabbix_agentd
[16239.266548] [   3768] 22502  3768     6657        2    90112      329             0 zabbix_agentd
[16239.266550] [   3769] 22502  3769     6657       16    90112      314             0 zabbix_agentd
[16239.266554] [   3770] 22502  3770     6657       40    90112      289             0 zabbix_agentd
[16239.266556] [   3771] 22502  3771     6657       11    90112      319             0 zabbix_agentd
[16239.266558] [   3772] 22502  3772     6657        8    90112      321             0 zabbix_agentd
[16239.266560] [   3773] 22502  3773     6763      208    94208      342             0 zabbix_agentd
[16239.266562] [   4065]     0  4065    10867       20    77824      190             0 master
[16239.266565] [   4109]   108  4109    11635       10    90112      192             0 qmgr
[16239.266567] [   4722]     0  4722      641        0    45056       29             0 agetty
[16239.266570] [   5974]     0  5974     5256        1    90112      316             0 sshd
[16239.266572] [   5988] 20224  5988     5740        0    77824      448             0 systemd
[16239.266574] [   5990] 20224  5990    44419        1   110592      897             0 (sd-pam)
[16239.266576] [   6010] 20224  6010     5256        0    86016      335             0 sshd
[16239.266579] [   6011] 20224  6011     2667        1    65536      828             0 bash
[16239.266581] [   6145]     0  6145     3953        0    69632      208             0 sudo
[16239.266583] [   6151]     0  6151     2666        1    57344      920             0 bash
[16239.266586] [  11530]     0 11530     2144        0    53248       66             0 screen
[16239.266588] [  11532]     0 11532     2221        0    49152      175             0 screen
[16239.266591] [  11534]     0 11534     2481        0    57344      715             0 bash
[16239.266593] [  21703]     0 21703     5256        1    77824      318             0 sshd
[16239.266595] [  21709] 20224 21709     5256        0    73728      349             0 sshd
[16239.266597] [  21710] 20224 21710     2667        1    61440      828             0 bash
[16239.266599] [  21881]     0 21881     3953        0    73728      208             0 sudo
[16239.266601] [  21885]     0 21885     2765        1    65536     1200             0 bash
[16239.266604] [  25673]   108 25673    11109        0    73728      205             0 pickup
[16239.266605] [    416]   108   416    11110        0    81920      206             0 showq
[16239.266609] [  11054]     0 11054   438013      471   483328    11944          -500 dockerd
[16239.266613] [  12923]     0 12923   358596        0   376832     9974             0 docker
[16239.266616] [  12956]     0 12956   358660        0   372736     8946             0 docker
[16239.266619] [  13485]     0 13485   137180        0   114688     1371          -500 docker-proxy
[16239.266623] [  13492]     0 13492    26924      116    69632      914          -999 containerd-shim
[16239.266626] [  13506]     0 13506    26924      154    69632      861          -999 containerd-shim
[16239.266629] [  13528]  5050 13528     6237      188    94208     3999             0 gunicorn
[16239.266631] [  13534]    70 13534  1598453    35329   397312      206             0 postgres
[16239.266634] [  13715]     0 13715      714        0    45056       62             0 master
[16239.266636] [  13716]   100 13716      712        0    45056       43             0 pickup
[16239.266645] [  13717]   100 13717      724        0    45056       41             0 qmgr
[16239.266647] [  13774]  5050 13774    26796      536   258048    19965             0 gunicorn
[16239.266649] [  15467]     0 15467   340227        0   368640     7119             0 docker
[16239.266651] [  20613]    70 20613  1598453       17    94208      236             0 postgres
[16239.266652] [  20614]    70 20614  1598489    12089   212992      221             0 postgres
[16239.266654] [  20615]    70 20615  1598453     4165   126976      226             0 postgres
[16239.266656] [  20616]    70 20616  1598564      220   147456      266             0 postgres
[16239.266658] [  20617]    70 20617    11518       68    77824      214             0 postgres
[16239.266660] [  20618]    70 20618  1598534      184   143360      273             0 postgres
[16239.266662] [  20619]    70 20619  1598532        0   118784      306             0 postgres
[16239.266664] [  20620]    70 20620  1598986      944   212992      540             0 postgres
[16239.266666] [  20621]    70 20621  1598986     1102   212992      536             0 postgres
[16239.266668] [  20637]    70 20637  4222263  1347732 21577728  1282963             0 postgres
[16239.266670] [  20691]    70 20691  4060362  1379043 20291584  1095924             0 postgres
[16239.266671] [  20805]    70 20805  3875951  1461398 18739200   823175             0 postgres
[16239.266673] [  20854]    70 20854  3869835  1415478 18767872   868652             0 postgres
[16239.266676] [  20931]    70 20931  1607228    44795  1392640      193             0 postgres
[16239.266680] [  20993]    70 20993  1599540    96876  1466368       98             0 postgres
[16239.266682] Out of memory: Kill process 20637 (postgres) score 254 or sacrifice child
[16239.268315] Killed process 20637 (postgres) total-vm:16889052kB, anon-rss:5357136kB, file-rss:0kB, shmem-rss:33792kB
[16239.798933] oom_reaper: reaped process 20637 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:33792kB

postgres.conf has been configured using timescaledb-tune and all set to recommended values.

No idea where to look further from here.

rockaut commented 4 years ago

Heja there ... we have the exact same problem here ( just with a way less smaller db ). Also with zabbix. Converted the tables back to postgres native for now.

zabbix=> \dx
                                         List of installed extensions
        Name        | Version |   Schema   |                            Description
--------------------+---------+------------+-------------------------------------------------------------------
 timescaledb        | 1.3.2   | public     | Enables scalable inserts and complex queries for time-series data
(5 rows)

zabbix=> SELECT version();
                                                 version
---------------------------------------------------------------------------------------------------------
 PostgreSQL 11.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)

zabbix=>
neogan74 commented 4 years ago

I investigated this deeply. It seems it all related to some PostgreSQL issue, On clean Postgres 10,11 with partitioning. If trends also partitioned by day, and when you reach 350-400 partitions you can face such an OOM crash. SO solution at least for zabbix could use bigger partitions by week or month, then you never face this.

rockaut commented 4 years ago

Interesting. We had 32 days for history and 365 for trends. I don't know how much chunks we had for trends but even with history we would have had 5

I may test that theory once I got back to work after the current situation. You may loose some performance optimization from build in timescale functions.

Hmm... Will post some thoughts o er at Zabbix issue too.

GioMac commented 4 years ago

Same here Centos 7 (latest updates) Zabbix 4.4.7 from repo.zabbix.com postgresql11-server-11.7-1PGDG.rhel7.x86_64 timescaledb_11-1.6.1-1.rhel7.x86_64

rockaut commented 4 years ago

The Zabbix team are working on a workaround -> https://support.zabbix.com/browse/ZBX-16347 The of course cant fix it as the problem seams to be in Timescale or even in Postgres but they seam to highlight the problem in Docs and change the partition timespans for the trend tables.

@bboule you from Timescale team? Have you had any insights on this problem maybe for postgres itself or other reports too?

GioMac commented 4 years ago

According to Zabbix community, issue confirmed on Centos, Ubuntu and Debian distributions, probably, there are no others using this particular feature yet, therefore, I assume, it's not an issue with specific build...

erimatnor commented 3 years ago

Touching lots of partitions/chunks during UPDATEs can lead to OOM. PostgreSQL 12 should improve the situations as it prunes subtables during deletes and updates. Note that TimescaleDB doesn't specifically optimize updates right now.

Closing since this issue doesn't seem specific to TimescaleDB.

hostops commented 3 years ago

Hello! I do not think this issue should be closed. We have same problem with DELETE statements in pg13.3 and ts2.3.1. I wrote the details of our issue here: https://github.com/timescale/timescaledb/issues/1085