App process XXXX exited on signal 7

vladimirtiukhtin commented 3 months ago

With upgrade to 1.32 I began to face the above. How do I debug this?

ac000 commented 3 months ago

I'm going to assume this is on Linux and thus signal 7 is a SIGBUS.

If you have a coredump you can try

$ gdb /path/to/unitd /path/to/coredump
(gdb) bt

and show the output.

If you don't have a coredump, make sure the shell that starts unit has ulimit -c unlimited

If you have the systemd-coredump service enabled, then try

$ coredumpctl gdb
(gdb) bt

vladimirtiukhtin commented 1 month ago

Hi @ac000. I must rename the ticket. I am getting such issues on later and earlier versions of unit. I run unit in kubernetes, so there is no systemd. But I enabled debug mode and here it is. Any advice will be much appreciated

2024/05/15 10:46:33.756 [debug] 70#98 free(7FC39800E880)
2024/05/15 10:46:33.756 [debug] 70#98 free(7FC398004890)
2024/05/15 10:46:33.756 [debug] 70#98 *11 timer found minimum: 1065213840±50:1065033840
2024/05/15 10:46:33.756 [debug] 70#98 epoll_wait(26) timeout:180000
2024/05/15 10:46:33.820 [debug] 71#71 epoll_wait(3): 1
2024/05/15 10:46:33.820 [debug] 71#71 epoll: fd:4 ev:0001 d:555B29A669C8 rd:0 wr:0
2024/05/15 10:46:33.820 [debug] 71#71 timer expire minimum: 1151429004:1065033904
2024/05/15 10:46:33.820 [debug] 71#71 work queue: fast
2024/05/15 10:46:33.820 [debug] 71#71 signalfd handler
2024/05/15 10:46:33.820 [debug] 71#71 read signalfd(4): 128
2024/05/15 10:46:33.820 [debug] 71#71 signalfd(4) signo:17
2024/05/15 10:46:33.820 [debug] 71#71 proto sigchld handler signo:17 (SIGCHLD)
2024/05/15 10:46:33.820 [debug] 71#71 waitpid(): 43267
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A7C380)
2024/05/15 10:46:33.820 [debug] 71#71 process (isolated 43267) removed
2024/05/15 10:46:33.820 [alert] 71#71 app process 43267 exited on signal 7
2024/05/15 10:46:33.820 [debug] 71#71 malloc(136): 555B29A7C380
2024/05/15 10:46:33.820 [debug] 71#71 posix_memalign(128, 1024): 555B29A84780
2024/05/15 10:46:33.820 [debug] 71#71 mp 555B29A66650 retain: 2
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_lock(555B29A730C0) enter
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_unlock(555B29A730C0) exit
2024/05/15 10:46:33.820 [debug] 71#71 using plain mode
2024/05/15 10:46:33.820 [debug] 71#71 sendbuf: 0, 555B29A847F8, 4
2024/05/15 10:46:33.820 [debug] 71#71 sendmsg(9, -1, -1, 2): 20
2024/05/15 10:46:33.820 [debug] 71#71 mp 555B29A66650 retain: 3
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_lock(555B29A7B160) enter
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_unlock(555B29A7B160) exit
2024/05/15 10:46:33.820 [debug] 71#71 using plain mode
2024/05/15 10:46:33.820 [debug] 71#71 sendbuf: 0, 555B29A84878, 4
2024/05/15 10:46:33.820 [debug] 71#71 sendmsg(13, -1, -1, 2): 20
2024/05/15 10:46:33.820 [debug] 71#71 port 555B29A7A880 43267:0 close, type 5
2024/05/15 10:46:33.820 [debug] 71#71 close(12)
2024/05/15 10:46:33.820 [debug] 1#1 epoll_wait(3): 1
2024/05/15 10:46:33.820 [debug] 71#71 port 555B29A7A880 43267:0 release, type 5
2024/05/15 10:46:33.820 [debug] 1#1 epoll: fd:8 ev:0001 d:555B29A73010 rd:5 wr:0
2024/05/15 10:46:33.820 [debug] 71#71 mp 555B29A79070 release: 0
2024/05/15 10:46:33.820 [debug] 1#1 timer expire minimum: 1151429004:1065033904
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_destroy(555B29A7A930)
2024/05/15 10:46:33.820 [debug] 1#1 work queue: fast
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A7A880)
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A71EF0)
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A84300)
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A79070)
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_lock(555B29A659A8) enter
2024/05/15 10:46:33.820 [debug] 71#71 process 43267 removed
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_unlock(555B29A659A8) exit
2024/05/15 10:46:33.820 [debug] 1#1 recvmsg(8, 2, 32): 20
2024/05/15 10:46:33.820 [debug] 71#71 free(0)
2024/05/15 10:46:33.820 [debug] 71#71 pthread_mutex_destroy(555B29A7CB68)
2024/05/15 10:46:33.820 [debug] 1#1 port 8: message type:21 fds:-1,-1
2024/05/15 10:46:33.820 [debug] 71#71 free(555B29A7C280)
2024/05/15 10:46:33.820 [debug] 1#1 port remove pid 43267 handler

ac000 commented 1 month ago

Could you provide a little more details about your environment. What OS? raw metal, vm or container?. Unit config etc...

vladimirtiukhtin commented 1 month ago

I have 3 independent kubernetes clusters. Issue is reproducible on each of them. Interesting part is that app may run for a week or two before it begins to error with signal 7 and stops working

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/environment: development
    app.kubernetes.io/instance: main
    app.kubernetes.io/managed-by: terraform
    app.kubernetes.io/name: api
    app.kubernetes.io/part-of: XXX
  name: api
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: api
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app.kubernetes.io/config-hash: 2b16a68b939f7bc0a961d2f017ba6e16
        app.kubernetes.io/environment: development
        app.kubernetes.io/instance: main
        app.kubernetes.io/managed-by: terraform
        app.kubernetes.io/name: api
        app.kubernetes.io/part-of: XXX
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app.kubernetes.io/name
                    operator: In
                    values:
                      - api
              topologyKey: kubernetes.io/hostname
      automountServiceAccountToken: true
      containers:
        - args:
            - unitd-debug
            - --no-daemon
            - --control
            - unix:/var/run/control.unit.sock
          env: []
          image: api:5b1d704c
          imagePullPolicy: IfNotPresent
          name: api
          ports:
            - containerPort: 8001
              name: api
              protocol: TCP
            - containerPort: 8002
              name: metrics
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health
              port: api
              scheme: HTTP
            initialDelaySeconds: 30
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              cpu: "2"
              memory: 2Gi
            requests:
              cpu: "1"
              memory: 1Gi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /docker-entrypoint.d
              mountPropagation: None
              name: unit-config
              readOnly: true
            - mountPath: /var/lib/api/prometheus
              mountPropagation: None
              name: prometheus
      dnsPolicy: ClusterFirst
      enableServiceLinks: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      shareProcessNamespace: false
      terminationGracePeriodSeconds: 30
      volumes:
        - configMap:
            defaultMode: 420
            name: api-unit-config
            optional: false
          name: unit-config
        - emptyDir:
            medium: Memory
            sizeLimit: 32Mi
          name: prometheus
---
apiVersion: v1
data:
  unit.json: '{"applications":{"main":{"callable":"app","home":"/app/.venv","limits":{"requests":100,"timeout":300},"module":"main","path":"/app","processes":{"idle_timeout":60,"max":4,"spare":2},"protocol":"asgi","type":"python3"},"metrics":{"callable":"internal","home":"/app/.venv","module":"internal","path":"/app","processes":1,"protocol":"asgi","type":"python3"}},"listeners":{"*:8001":{"pass":"applications/main"},"*:8002":{"pass":"applications/metrics"}}}'
immutable: false
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/environment: development
    app.kubernetes.io/instance: main
    app.kubernetes.io/managed-by: terraform
    app.kubernetes.io/name: api
    app.kubernetes.io/part-of: XXX
  name: api-unit-config

ac000 commented 1 month ago

Thanks!

Given that it's getting a SIGBUS I'm still inclined to suspect it's being OOM-killed (Out of Memory) rather than an ouright crash due to a SIGSEGV or SIGABRT, but this is not a hard and fast rule...

I see you have two python applications, is it always the same application that has the issue?

I don't really know anything about kubernetes, but are you able to view dmesg(1) after such an issue occurs?

You'd be looking for lines that look something like

845092.010611] unitd[28179]: segfault at 0 ip 000000000042eb62 sp 00007fc0e77ff980 error 4 in unitd[407000+3b000] likely on CPU 0 (core 0, socket 0)
[845092.010645] Code: f7 d8 19 ed 48 89 df e8 3c 85 fd ff 89 e8 48 83 c4 08 5b 5d c3 bd 00 00 00 00 eb e8 bd ff ff ff ff eb e9 53 48 89 fb 48 8b 07 <48> 8b 30 48 8d 3d 15 50 01 00 b8 00 00 00 00 e8 8a 87 fd ff 48 8b

Or for an oom-kill you would see something along the lines of (you really can't miss it! and that's it been trimmed down...)

 kernel: dnf invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
 kernel: CPU: 1 PID: 17809 Comm: dnf Not tainted 6.3.5-200.fc38.x86_64 #1
 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
 kernel: Call Trace:
 kernel:  <TASK>
 kernel:  dump_stack_lvl+0x47/0x60
 kernel:  dump_header+0x4a/0x260
 kernel:  oom_kill_process+0xf9/0x190
 kernel:  out_of_memory+0x1d2/0x570
 kernel:  __alloc_pages_slowpath.constprop.0+0xcbc/0xe10
 kernel:  __alloc_pages+0x224/0x250
 kernel:  folio_alloc+0x1b/0x50
 kernel:  __filemap_get_folio+0x15e/0x430
 kernel:  filemap_fault+0x169/0x950
 kernel:  __do_fault+0x30/0x150
 kernel:  do_fault+0x1d1/0x430
 kernel:  __handle_mm_fault+0x653/0xf70
 kernel:  handle_mm_fault+0x11e/0x310
 kernel:  do_user_addr_fault+0x1be/0x720
 kernel:  exc_page_fault+0x7c/0x180
 kernel:  asm_exc_page_fault+0x26/0x30
 kernel: RIP: 0033:0x7fa9473bbed0
 kernel: Code: Unable to access opcode bytes at 0x7fa9473bbea6.
 kernel: RSP: 002b:00007ffdc30f0a28 EFLAGS: 00010246
 kernel: RAX: 000000000020000c RBX: 000000000020004c RCX: 0000000000000001
 kernel: RDX: 00007fa946cbbd00 RSI: 0000000000000000 RDI: 000000000020004c
 kernel: RBP: 00007ffdc30f0a40 R08: 0000000000000000 R09: 0000000561b93334
 kernel: R10: 00007ffdc3163080 R11: 0000000003554222 R12: 0000561b933343d0
 kernel: R13: 00007fa9477de1c0 R14: 0000561b8fa760d0 R15: 0000561b8fa84688
 kernel:  </TASK>
 kernel: Mem-Info:
 kernel: active_anon:82801 inactive_anon:33087 isolated_anon:0
                                                     active_file:135 inactive_file:8 isolated_file:0
                                                     unevictable:0 dirty:0 writeback:0
                                                     slab_reclaimable:8301 slab_unreclaimable:11341
                                                     mapped:251 shmem:5 pagetables:5864
                                                     sec_pagetables:0 bounce:0
                                                     kernel_misc_reclaimable:0
                                                     free:12963 free_pcp:25 free_cma:0
 kernel: Node 0 active_anon:331204kB inactive_anon:132348kB active_file:540kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1004kB dirty:0kB writeback:0kB shmem:20kB shmem_thp: 0kB shmem_pmdmappe>
 kernel: Node 0 DMA free:4284kB boost:0kB min:732kB low:912kB high:1092kB reserved_highatomic:0KB active_anon:3236kB inactive_anon:1884kB active_file:0kB inactive_file:92kB unevictable:0kB writepending:0kB present:15992kB managed>
 kernel: lowmem_reserve[]: 0 907 907 907 907
 kernel: Node 0 DMA32 free:47568kB boost:0kB min:44320kB low:55400kB high:66480kB reserved_highatomic:4096KB active_anon:327968kB inactive_anon:130464kB active_file:528kB inactive_file:480kB unevictable:0kB writepending:0kB prese>
 kernel: lowmem_reserve[]: 0 0 0 0 0
 kernel: Node 0 DMA: 22*4kB (UME) 30*8kB (UME) 41*16kB (UME) 24*32kB (UE) 11*64kB (UME) 3*128kB (ME) 0*256kB 1*512kB (M) 1*1024kB (E) 0*2048kB 0*4096kB = 4376kB
 kernel: Node 0 DMA32: 266*4kB (UME) 58*8kB (UM) 100*16kB (UME) 24*32kB (UM) 6*64kB (UM) 24*128kB (M) 42*256kB (UM) 9*512kB (UM) 6*1024kB (UM) 9*2048kB (M) 0*4096kB = 47288kB
 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
 kernel: 446 total pagecache pages
 kernel: 8 pages in swap cache
 kernel: Free swap  = 120kB
 kernel: Total swap = 969724kB
 kernel: 262011 pages RAM
 kernel: 0 pages HighMem/MovableOnly
 kernel: 19547 pages reserved
 kernel: 0 pages cma reserved
 kernel: 0 pages hwpoisoned
kernel: Tasks state (memory values in pages):
 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
 kernel: [    397]     0   397    11997       44    94208      288          -250 systemd-journal
 kernel: [    412]     0   412     7708       32    86016      512         -1000 systemd-udevd
 kernel: [    455]   999   455     3796       96    65536      192          -900 systemd-oomd
 kernel: [    458]   193   458     4886      128    73728      448             0 systemd-resolve
 kernel: [    459]     0   459     3720       32    65536      192             0 systemd-userdbd
 kernel: [    466]     0   466     4457       12    57344      192         -1000 auditd
 kernel: [    486]     0   486    83119      128   143360      608             0 NetworkManager
 kernel: [    489]     0   489    19792       32    53248       32             0 irqbalance
 kernel: [    490]     0   490      654       32    40960       32             0 mcelog
 kernel: [    491]   998   491    76726       64    86016      192             0 polkitd
 kernel: [    492]     0   492    76959       50   372736      352             0 rsyslogd
 kernel: [    493]     0   493     3860       32    73728      256             0 systemd-logind
 kernel: [    494]    81   494     2326        4    57344      192          -900 dbus-broker-lau
 kernel: [    502]   994   502    21689        3    69632      192             0 chronyd
 kernel: [    506]    81   506     1316        9    57344      192          -900 dbus-broker
 kernel: [    515]     0   515     3522       64    69632      288         -1000 sshd
 kernel: [    521]     0   521    79022       32   110592      448             0 ModemManager
 kernel: [    527]     0   527    13689        3    81920      224             0 gssproxy
 kernel: [    535]     0   535      815        0    45056       32             0 atd
 kernel: [    536]     0   536     2098        0    53248      192             0 crond
 kernel: [    539]     0   539      719       32    49152       32             0 agetty
 kernel: [    603]     0   603     4012       32    69632      384             0 sshd
 kernel: [    606]  1000   606     4512       32    77824      448           100 systemd
 kernel: [    607]  1000   607    26347       34    94208      896           100 (sd-pam)
 kernel: [    614]  1000   614     4096        7    69632      480             0 sshd
 kernel: [    615]  1000   615     2286        0    57344      608             0 bash
 kernel: [    899]     0   899     4021       32    73728      384             0 sshd
 kernel: [    901]  1000   901     4077       34    77824      480             0 sshd
 kernel: [    902]  1000   902     2283       32    49152      608             0 bash
 kernel: [    947]     0   947     4022       32    69632      384             0 sshd
 kernel: [    952]  1000   952     4096        0    69632      512             0 sshd
 kernel: [    953]  1000   953     2999        0    61440      928             0 bash
 kernel: [   3610]     0  3610     4022       64    69632      384             0 sshd
 kernel: [   3612]  1000  3612     4096       45    69632      448             0 sshd
 kernel: [   3613]  1000  3613     2250       32    61440      544             0 bash
 kernel: [   4162]     0  4162     4022       64    69632      384             0 sshd
 kernel: [   4164]  1000  4164     4096       64    69632      448             0 sshd
 kernel: [   4165]  1000  4165     2250        0    57344      608             0 bash
 kernel: [   6093]  1000  6093     1247       32    49152       32             0 unitd
 kernel: [   6095]  1000  6095     1086       34    49152       32             0 unitd
 kernel: [   6096]  1000  6096    44999       98    86016       64             0 unitd
 kernel: [   6097]  1000  6097    50868       99   126976      992             0 unitd
 kernel: [   6098]  1000  6098    57560       81   143360     1088             0 unitd
 kernel: [  15599]  1000 15599     3884       32    73728      256             0 su
 kernel: [  15697]     0 15697     3813       32    65536      192             0 systemd-userwor
 kernel: [  15908]     0 15908     3813        0    65536      224             0 systemd-userwor
 kernel: [  15909]     0 15909     3813       32    69632      192             0 systemd-userwor
 kernel: [  17808]  1000 17808     4526       32    73728      288             0 sudo
 kernel: [  17809]     0 17809   103114    16972   446464    15772             0 dnf
 kernel: [  17810]  1000 17810     3488      160    65536      448             0 make
 kernel: [  17826]  1000 17826     1221        0    49152       32             0 cc
 kernel: [  17832]  1000 17832    13133      576   135168     3168             0 cc1
 kernel: [  17833]  1000 17833     1221       32    49152       32             0 cc
 kernel: [  17837]  1000 17837     1659       32    53248      384             0 as
 kernel: [  17839]  1000 17839    13156      608   135168     2976             0 cc1
 kernel: [  17844]  1000 17844     1221       32    45056       32             0 cc
 kernel: [  17845]  1000 17845     1221       32    53248       32             0 cc
 kernel: [  17847]  1000 17847    13157      416   139264     3328             0 cc1
 kernel: [  17851]  1000 17851     1659       32    49152      352             0 as
 kernel: [  17853]  1000 17853     1659       32    49152      384             0 as
 kernel: [  17856]  1000 17856     1221       32    40960       32             0 cc
 kernel: [  17857]  1000 17857    12944      576   139264     2880             0 cc1
 kernel: [  17859]  1000 17859     1659       32    53248      352             0 as
 kernel: [  17860]  1000 17860     1221       32    49152       32             0 cc
 kernel: [  17862]  1000 17862     1221       32    40960       32             0 cc
 kernel: [  17863]  1000 17863    13138     1120   143360     2432             0 cc1
 kernel: [  17864]  1000 17864     1221       32    49152       32             0 cc
 kernel: [  17865]  1000 17865     1659       64    49152      352             0 as
 kernel: [  17866]  1000 17866    12881     1312   135168     2144             0 cc1
 kernel: [  17867]  1000 17867     1221       32    49152       32             0 cc
 kernel: [  17868]  1000 17868     1659       32    49152      384             0 as
 kernel: [  17869]  1000 17869    12975      832   139264     2752             0 cc1
 kernel: [  17871]  1000 17871    13125      576   143360     2912             0 cc1
 kernel: [  17872]  1000 17872     1221       32    40960       32             0 cc
 kernel: [  17873]  1000 17873     1659       32    57344      352             0 as
 kernel: [  17874]  1000 17874    12964      768   139264     2816             0 cc1
 kernel: [  17876]  1000 17876     1659       32    49152      352             0 as
 kernel: [  17878]  1000 17878     1221        0    45056       32             0 cc
 kernel: [  17879]  1000 17879     1659       32    49152      352             0 as
 kernel: [  17880]  1000 17880    13113      608   135168     2784             0 cc1
 kernel: [  17883]  1000 17883     1221       32    49152       32             0 cc
 kernel: [  17884]  1000 17884     1659       32    49152      384             0 as
 kernel: [  17885]  1000 17885    13120      512   139264     3136             0 cc1
 kernel: [  17887]  1000 17887     1221       32    40960       32             0 cc
 kernel: [  17888]  1000 17888     1659       32    53248      384             0 as
 kernel: [  17889]  1000 17889     1221       32    45056       32             0 cc
 kernel: [  17890]  1000 17890    12761     1536   131072     1568             0 cc1
 kernel: [  17891]  1000 17891    12963      511   139264     2784             0 cc1
 kernel: [  17892]  1000 17892     1221       32    49152       32             0 cc
 kernel: [  17893]  1000 17893     1659       32    49152      352             0 as
 kernel: [  17894]  1000 17894     1659       32    49152      352             0 as
 kernel: [  17895]  1000 17895     1221        0    45056       32             0 cc
 kernel: [  17896]  1000 17896    12739     1280   135168     1792             0 cc1
 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/init.scope,task=(sd-pam),pid=607,uid=1000
 kernel: Out of memory: Killed process 607 ((sd-pam)) total-vm:105388kB, anon-rss:8kB, file-rss:128kB, shmem-rss:0kB, UID:1000 pgtables:92kB oom_score_adj:100

vladimirtiukhtin commented 1 month ago

Thank for your prompt responses @ac000. I went through dmesg on all my kubernetes nodes - no, there aren't any memory issues. I also checked prometheus for the period and we never went above allowed 2Gb of memory But let me increase memory limits and run it for a while. I will update you

vladimirtiukhtin commented 2 weeks ago

I am afraid even with memory allocated well-above actual usage I am still seeing signal 7. Unit cannot spin up workers and app goes down

nginx / unit

App process XXXX exited on signal 7 #1171