Open vladimirtiukhtin opened 3 months ago
I'm going to assume this is on Linux and thus signal 7 is a SIGBUS.
If you have a coredump you can try
$ gdb /path/to/unitd /path/to/coredump
(gdb) bt
and show the output.
If you don't have a coredump, make sure the shell that starts unit has ulimit -c unlimited
If you have the systemd-coredump service enabled, then try
$ coredumpctl gdb
(gdb) bt
Hi @ac000. I must rename the ticket. I am getting such issues on later and earlier versions of unit. I run unit in kubernetes, so there is no systemd. But I enabled debug mode and here it is. Any advice will be much appreciated
2024/05/15 10:46:33.756 [debug] 70#98 free(7FC39800E880)
2024/05/15 10:46:33.756 [debug] 70#98 free(7FC398004890)
2024/05/15 10:46:33.756 [debug] 70#98 *11 timer found minimum: 1065213840±50:1065033840
2024/05/15 10:46:33.756 [debug] 70#98 epoll_wait(26) timeout:180000
2024/05/15 10:46:33.820 [debug] 71#71 epoll_wait(3): 1
2024/05/15 10:46:33.820 [debug] 71#71 epoll: fd:4 ev:0001 d:555B29A669C8 rd:0 wr:0
2024/05/15 10:46:33.820 [debug] 71#71 timer expire minimum: 1151429004:1065033904
2024/05/15 10:46:33.820 [debug] 71#71 work queue: fast
2024/05/15 10:46:33.820 [debug] 71#71 signalfd handler
2024/05/15 10:46:33.820 [debug] 71#71 read signalfd(4): 128
2024/05/15 10:46:33.820 [debug] 71#71 signalfd(4) signo:17
2024/05/15 10:46:33.820 [debug] 71#71 proto sigchld handler signo:17 (SIGCHLD)
2024/05/15 10:46:33.820 [debug] 71#71 waitpid(): 43267
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A7C380)
2024/05/15 10:46:33.820 [debug] 71#71 process (isolated 43267) removed
2024/05/15 10:46:33.820 [alert] 71#71 app process 43267 exited on signal 7
2024/05/15 10:46:33.820 [debug] 71#71 malloc(136): 555B29A7C380
2024/05/15 10:46:33.820 [debug] 71#71 posix_memalign(128, 1024): 555B29A84780
2024/05/15 10:46:33.820 [debug] 71#71 mp 555B29A66650 retain: 2
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_lock(555B29A730C0) enter
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_unlock(555B29A730C0) exit
2024/05/15 10:46:33.820 [debug] 71#71 using plain mode
2024/05/15 10:46:33.820 [debug] 71#71 sendbuf: 0, 555B29A847F8, 4
2024/05/15 10:46:33.820 [debug] 71#71 sendmsg(9, -1, -1, 2): 20
2024/05/15 10:46:33.820 [debug] 71#71 mp 555B29A66650 retain: 3
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_lock(555B29A7B160) enter
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_unlock(555B29A7B160) exit
2024/05/15 10:46:33.820 [debug] 71#71 using plain mode
2024/05/15 10:46:33.820 [debug] 71#71 sendbuf: 0, 555B29A84878, 4
2024/05/15 10:46:33.820 [debug] 71#71 sendmsg(13, -1, -1, 2): 20
2024/05/15 10:46:33.820 [debug] 71#71 port 555B29A7A880 43267:0 close, type 5
2024/05/15 10:46:33.820 [debug] 71#71 close(12)
2024/05/15 10:46:33.820 [debug] 1#1 epoll_wait(3): 1
2024/05/15 10:46:33.820 [debug] 71#71 port 555B29A7A880 43267:0 release, type 5
2024/05/15 10:46:33.820 [debug] 1#1 epoll: fd:8 ev:0001 d:555B29A73010 rd:5 wr:0
2024/05/15 10:46:33.820 [debug] 71#71 mp 555B29A79070 release: 0
2024/05/15 10:46:33.820 [debug] 1#1 timer expire minimum: 1151429004:1065033904
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_destroy(555B29A7A930)
2024/05/15 10:46:33.820 [debug] 1#1 work queue: fast
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A7A880)
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A71EF0)
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A84300)
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A79070)
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_lock(555B29A659A8) enter
2024/05/15 10:46:33.820 [debug] 71#71 process 43267 removed
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_unlock(555B29A659A8) exit
2024/05/15 10:46:33.820 [debug] 1#1 recvmsg(8, 2, 32): 20
2024/05/15 10:46:33.820 [debug] 71#71 free(0)
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_destroy(555B29A7CB68)
2024/05/15 10:46:33.820 [debug] 1#1 port 8: message type:21 fds:-1,-1
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A7C280)
2024/05/15 10:46:33.820 [debug] 1#1 port remove pid 43267 handler
Could you provide a little more details about your environment. What OS? raw metal, vm or container?. Unit config etc...
I have 3 independent kubernetes clusters. Issue is reproducible on each of them. Interesting part is that app may run for a week or two before it begins to error with signal 7 and stops working
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/environment: development
app.kubernetes.io/instance: main
app.kubernetes.io/managed-by: terraform
app.kubernetes.io/name: api
app.kubernetes.io/part-of: XXX
name: api
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: api
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
app.kubernetes.io/config-hash: 2b16a68b939f7bc0a961d2f017ba6e16
app.kubernetes.io/environment: development
app.kubernetes.io/instance: main
app.kubernetes.io/managed-by: terraform
app.kubernetes.io/name: api
app.kubernetes.io/part-of: XXX
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
automountServiceAccountToken: true
containers:
- args:
- unitd-debug
- --no-daemon
- --control
- unix:/var/run/control.unit.sock
env: []
image: api:5b1d704c
imagePullPolicy: IfNotPresent
name: api
ports:
- containerPort: 8001
name: api
protocol: TCP
- containerPort: 8002
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: api
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
cpu: "2"
memory: 2Gi
requests:
cpu: "1"
memory: 1Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /docker-entrypoint.d
mountPropagation: None
name: unit-config
readOnly: true
- mountPath: /var/lib/api/prometheus
mountPropagation: None
name: prometheus
dnsPolicy: ClusterFirst
enableServiceLinks: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
shareProcessNamespace: false
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: api-unit-config
optional: false
name: unit-config
- emptyDir:
medium: Memory
sizeLimit: 32Mi
name: prometheus
---
apiVersion: v1
data:
unit.json: '{"applications":{"main":{"callable":"app","home":"/app/.venv","limits":{"requests":100,"timeout":300},"module":"main","path":"/app","processes":{"idle_timeout":60,"max":4,"spare":2},"protocol":"asgi","type":"python3"},"metrics":{"callable":"internal","home":"/app/.venv","module":"internal","path":"/app","processes":1,"protocol":"asgi","type":"python3"}},"listeners":{"*:8001":{"pass":"applications/main"},"*:8002":{"pass":"applications/metrics"}}}'
immutable: false
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/environment: development
app.kubernetes.io/instance: main
app.kubernetes.io/managed-by: terraform
app.kubernetes.io/name: api
app.kubernetes.io/part-of: XXX
name: api-unit-config
Thanks!
Given that it's getting a SIGBUS
I'm still inclined to suspect it's being OOM-killed (Out of Memory) rather than an ouright crash due to a SIGSEGV or SIGABRT, but this is not a hard and fast rule...
I see you have two python applications, is it always the same application that has the issue?
I don't really know anything about kubernetes, but are you able to view dmesg(1)
after such an issue occurs?
You'd be looking for lines that look something like
845092.010611] unitd[28179]: segfault at 0 ip 000000000042eb62 sp 00007fc0e77ff980 error 4 in unitd[407000+3b000] likely on CPU 0 (core 0, socket 0)
[845092.010645] Code: f7 d8 19 ed 48 89 df e8 3c 85 fd ff 89 e8 48 83 c4 08 5b 5d c3 bd 00 00 00 00 eb e8 bd ff ff ff ff eb e9 53 48 89 fb 48 8b 07 <48> 8b 30 48 8d 3d 15 50 01 00 b8 00 00 00 00 e8 8a 87 fd ff 48 8b
Or for an oom-kill you would see something along the lines of (you really can't miss it! and that's it been trimmed down...)
kernel: dnf invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
kernel: CPU: 1 PID: 17809 Comm: dnf Not tainted 6.3.5-200.fc38.x86_64 #1
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
kernel: Call Trace:
kernel: <TASK>
kernel: dump_stack_lvl+0x47/0x60
kernel: dump_header+0x4a/0x260
kernel: oom_kill_process+0xf9/0x190
kernel: out_of_memory+0x1d2/0x570
kernel: __alloc_pages_slowpath.constprop.0+0xcbc/0xe10
kernel: __alloc_pages+0x224/0x250
kernel: folio_alloc+0x1b/0x50
kernel: __filemap_get_folio+0x15e/0x430
kernel: filemap_fault+0x169/0x950
kernel: __do_fault+0x30/0x150
kernel: do_fault+0x1d1/0x430
kernel: __handle_mm_fault+0x653/0xf70
kernel: handle_mm_fault+0x11e/0x310
kernel: do_user_addr_fault+0x1be/0x720
kernel: exc_page_fault+0x7c/0x180
kernel: asm_exc_page_fault+0x26/0x30
kernel: RIP: 0033:0x7fa9473bbed0
kernel: Code: Unable to access opcode bytes at 0x7fa9473bbea6.
kernel: RSP: 002b:00007ffdc30f0a28 EFLAGS: 00010246
kernel: RAX: 000000000020000c RBX: 000000000020004c RCX: 0000000000000001
kernel: RDX: 00007fa946cbbd00 RSI: 0000000000000000 RDI: 000000000020004c
kernel: RBP: 00007ffdc30f0a40 R08: 0000000000000000 R09: 0000000561b93334
kernel: R10: 00007ffdc3163080 R11: 0000000003554222 R12: 0000561b933343d0
kernel: R13: 00007fa9477de1c0 R14: 0000561b8fa760d0 R15: 0000561b8fa84688
kernel: </TASK>
kernel: Mem-Info:
kernel: active_anon:82801 inactive_anon:33087 isolated_anon:0
active_file:135 inactive_file:8 isolated_file:0
unevictable:0 dirty:0 writeback:0
slab_reclaimable:8301 slab_unreclaimable:11341
mapped:251 shmem:5 pagetables:5864
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:12963 free_pcp:25 free_cma:0
kernel: Node 0 active_anon:331204kB inactive_anon:132348kB active_file:540kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1004kB dirty:0kB writeback:0kB shmem:20kB shmem_thp: 0kB shmem_pmdmappe>
kernel: Node 0 DMA free:4284kB boost:0kB min:732kB low:912kB high:1092kB reserved_highatomic:0KB active_anon:3236kB inactive_anon:1884kB active_file:0kB inactive_file:92kB unevictable:0kB writepending:0kB present:15992kB managed>
kernel: lowmem_reserve[]: 0 907 907 907 907
kernel: Node 0 DMA32 free:47568kB boost:0kB min:44320kB low:55400kB high:66480kB reserved_highatomic:4096KB active_anon:327968kB inactive_anon:130464kB active_file:528kB inactive_file:480kB unevictable:0kB writepending:0kB prese>
kernel: lowmem_reserve[]: 0 0 0 0 0
kernel: Node 0 DMA: 22*4kB (UME) 30*8kB (UME) 41*16kB (UME) 24*32kB (UE) 11*64kB (UME) 3*128kB (ME) 0*256kB 1*512kB (M) 1*1024kB (E) 0*2048kB 0*4096kB = 4376kB
kernel: Node 0 DMA32: 266*4kB (UME) 58*8kB (UM) 100*16kB (UME) 24*32kB (UM) 6*64kB (UM) 24*128kB (M) 42*256kB (UM) 9*512kB (UM) 6*1024kB (UM) 9*2048kB (M) 0*4096kB = 47288kB
kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
kernel: 446 total pagecache pages
kernel: 8 pages in swap cache
kernel: Free swap = 120kB
kernel: Total swap = 969724kB
kernel: 262011 pages RAM
kernel: 0 pages HighMem/MovableOnly
kernel: 19547 pages reserved
kernel: 0 pages cma reserved
kernel: 0 pages hwpoisoned
kernel: Tasks state (memory values in pages):
kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
kernel: [ 397] 0 397 11997 44 94208 288 -250 systemd-journal
kernel: [ 412] 0 412 7708 32 86016 512 -1000 systemd-udevd
kernel: [ 455] 999 455 3796 96 65536 192 -900 systemd-oomd
kernel: [ 458] 193 458 4886 128 73728 448 0 systemd-resolve
kernel: [ 459] 0 459 3720 32 65536 192 0 systemd-userdbd
kernel: [ 466] 0 466 4457 12 57344 192 -1000 auditd
kernel: [ 486] 0 486 83119 128 143360 608 0 NetworkManager
kernel: [ 489] 0 489 19792 32 53248 32 0 irqbalance
kernel: [ 490] 0 490 654 32 40960 32 0 mcelog
kernel: [ 491] 998 491 76726 64 86016 192 0 polkitd
kernel: [ 492] 0 492 76959 50 372736 352 0 rsyslogd
kernel: [ 493] 0 493 3860 32 73728 256 0 systemd-logind
kernel: [ 494] 81 494 2326 4 57344 192 -900 dbus-broker-lau
kernel: [ 502] 994 502 21689 3 69632 192 0 chronyd
kernel: [ 506] 81 506 1316 9 57344 192 -900 dbus-broker
kernel: [ 515] 0 515 3522 64 69632 288 -1000 sshd
kernel: [ 521] 0 521 79022 32 110592 448 0 ModemManager
kernel: [ 527] 0 527 13689 3 81920 224 0 gssproxy
kernel: [ 535] 0 535 815 0 45056 32 0 atd
kernel: [ 536] 0 536 2098 0 53248 192 0 crond
kernel: [ 539] 0 539 719 32 49152 32 0 agetty
kernel: [ 603] 0 603 4012 32 69632 384 0 sshd
kernel: [ 606] 1000 606 4512 32 77824 448 100 systemd
kernel: [ 607] 1000 607 26347 34 94208 896 100 (sd-pam)
kernel: [ 614] 1000 614 4096 7 69632 480 0 sshd
kernel: [ 615] 1000 615 2286 0 57344 608 0 bash
kernel: [ 899] 0 899 4021 32 73728 384 0 sshd
kernel: [ 901] 1000 901 4077 34 77824 480 0 sshd
kernel: [ 902] 1000 902 2283 32 49152 608 0 bash
kernel: [ 947] 0 947 4022 32 69632 384 0 sshd
kernel: [ 952] 1000 952 4096 0 69632 512 0 sshd
kernel: [ 953] 1000 953 2999 0 61440 928 0 bash
kernel: [ 3610] 0 3610 4022 64 69632 384 0 sshd
kernel: [ 3612] 1000 3612 4096 45 69632 448 0 sshd
kernel: [ 3613] 1000 3613 2250 32 61440 544 0 bash
kernel: [ 4162] 0 4162 4022 64 69632 384 0 sshd
kernel: [ 4164] 1000 4164 4096 64 69632 448 0 sshd
kernel: [ 4165] 1000 4165 2250 0 57344 608 0 bash
kernel: [ 6093] 1000 6093 1247 32 49152 32 0 unitd
kernel: [ 6095] 1000 6095 1086 34 49152 32 0 unitd
kernel: [ 6096] 1000 6096 44999 98 86016 64 0 unitd
kernel: [ 6097] 1000 6097 50868 99 126976 992 0 unitd
kernel: [ 6098] 1000 6098 57560 81 143360 1088 0 unitd
kernel: [ 15599] 1000 15599 3884 32 73728 256 0 su
kernel: [ 15697] 0 15697 3813 32 65536 192 0 systemd-userwor
kernel: [ 15908] 0 15908 3813 0 65536 224 0 systemd-userwor
kernel: [ 15909] 0 15909 3813 32 69632 192 0 systemd-userwor
kernel: [ 17808] 1000 17808 4526 32 73728 288 0 sudo
kernel: [ 17809] 0 17809 103114 16972 446464 15772 0 dnf
kernel: [ 17810] 1000 17810 3488 160 65536 448 0 make
kernel: [ 17826] 1000 17826 1221 0 49152 32 0 cc
kernel: [ 17832] 1000 17832 13133 576 135168 3168 0 cc1
kernel: [ 17833] 1000 17833 1221 32 49152 32 0 cc
kernel: [ 17837] 1000 17837 1659 32 53248 384 0 as
kernel: [ 17839] 1000 17839 13156 608 135168 2976 0 cc1
kernel: [ 17844] 1000 17844 1221 32 45056 32 0 cc
kernel: [ 17845] 1000 17845 1221 32 53248 32 0 cc
kernel: [ 17847] 1000 17847 13157 416 139264 3328 0 cc1
kernel: [ 17851] 1000 17851 1659 32 49152 352 0 as
kernel: [ 17853] 1000 17853 1659 32 49152 384 0 as
kernel: [ 17856] 1000 17856 1221 32 40960 32 0 cc
kernel: [ 17857] 1000 17857 12944 576 139264 2880 0 cc1
kernel: [ 17859] 1000 17859 1659 32 53248 352 0 as
kernel: [ 17860] 1000 17860 1221 32 49152 32 0 cc
kernel: [ 17862] 1000 17862 1221 32 40960 32 0 cc
kernel: [ 17863] 1000 17863 13138 1120 143360 2432 0 cc1
kernel: [ 17864] 1000 17864 1221 32 49152 32 0 cc
kernel: [ 17865] 1000 17865 1659 64 49152 352 0 as
kernel: [ 17866] 1000 17866 12881 1312 135168 2144 0 cc1
kernel: [ 17867] 1000 17867 1221 32 49152 32 0 cc
kernel: [ 17868] 1000 17868 1659 32 49152 384 0 as
kernel: [ 17869] 1000 17869 12975 832 139264 2752 0 cc1
kernel: [ 17871] 1000 17871 13125 576 143360 2912 0 cc1
kernel: [ 17872] 1000 17872 1221 32 40960 32 0 cc
kernel: [ 17873] 1000 17873 1659 32 57344 352 0 as
kernel: [ 17874] 1000 17874 12964 768 139264 2816 0 cc1
kernel: [ 17876] 1000 17876 1659 32 49152 352 0 as
kernel: [ 17878] 1000 17878 1221 0 45056 32 0 cc
kernel: [ 17879] 1000 17879 1659 32 49152 352 0 as
kernel: [ 17880] 1000 17880 13113 608 135168 2784 0 cc1
kernel: [ 17883] 1000 17883 1221 32 49152 32 0 cc
kernel: [ 17884] 1000 17884 1659 32 49152 384 0 as
kernel: [ 17885] 1000 17885 13120 512 139264 3136 0 cc1
kernel: [ 17887] 1000 17887 1221 32 40960 32 0 cc
kernel: [ 17888] 1000 17888 1659 32 53248 384 0 as
kernel: [ 17889] 1000 17889 1221 32 45056 32 0 cc
kernel: [ 17890] 1000 17890 12761 1536 131072 1568 0 cc1
kernel: [ 17891] 1000 17891 12963 511 139264 2784 0 cc1
kernel: [ 17892] 1000 17892 1221 32 49152 32 0 cc
kernel: [ 17893] 1000 17893 1659 32 49152 352 0 as
kernel: [ 17894] 1000 17894 1659 32 49152 352 0 as
kernel: [ 17895] 1000 17895 1221 0 45056 32 0 cc
kernel: [ 17896] 1000 17896 12739 1280 135168 1792 0 cc1
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/init.scope,task=(sd-pam),pid=607,uid=1000
kernel: Out of memory: Killed process 607 ((sd-pam)) total-vm:105388kB, anon-rss:8kB, file-rss:128kB, shmem-rss:0kB, UID:1000 pgtables:92kB oom_score_adj:100
Thank for your prompt responses @ac000. I went through dmesg on all my kubernetes nodes - no, there aren't any memory issues. I also checked prometheus for the period and we never went above allowed 2Gb of memory
But let me increase memory limits and run it for a while. I will update you
I am afraid even with memory allocated well-above actual usage I am still seeing signal 7. Unit cannot spin up workers and app goes down
With upgrade to 1.32 I began to face the above. How do I debug this?