squat / generic-device-plugin

A Kubernetes device plugin to schedule generic Linux devices
Apache License 2.0
198 stars 22 forks source link

Investigate Increased Memory Usage #17

Open dejanzelic opened 1 year ago

dejanzelic commented 1 year ago

Hola!

Thanks for making this, it's super useful!

I'm running this on a 4 cluster node (3x 2GB Pi 4b, and 1x 8GB Pi 4b) to expose the sound device and USB device. I was having a weird issue where everything works great for 2 of my nodes and the I could only get it to work with the other 2 nodes once (with the default DaemonSet config). The 2 nodes that didn't work would never show any logs.

The issue started when I was trying to use the new USB device feature. But even when I would go back to the default, I still had the same issue.

I read the issue here: https://github.com/squat/generic-device-plugin/issues/11 And decided to try also upping the memory limit. As soon as I did that everything worked!

So it does sound like the new build uses more memory. Here is my current config that's working:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: generic-device-plugin
  namespace: kube-system
  labels:
    app.kubernetes.io/name: generic-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: generic-device-plugin
  template:
    metadata:
      labels:
        app.kubernetes.io/name: generic-device-plugin
    spec:
      priorityClassName: system-node-critical
      tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"
      containers:
      - image: squat/generic-device-plugin
        args:
        - --device
        - '{"name": "audio", "groups": [{"count": 10, "paths": [{"path": "/dev/snd"}]}]}'
        - --device
        - '{"name": "zwave", "groups": [{"usb": [{"vendor": "0658", "product": "0200"}]}]}'
        name: generic-device-plugin
        resources:
          requests:
            cpu: 50m
            memory: 20Mi
          limits:
            cpu: 50m
            memory: 20Mi
        ports:
        - containerPort: 8080
          name: http
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: dev
          mountPath: /dev
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev
        hostPath:
          path: /dev
  updateStrategy:
    type: RollingUpdate

While I'm here, I figured I should give some additional feedback. On the nodes without the USB device, I get this log message constently:

{"caller":"usb.go:245","level":"info","msg":"no USB devices found attached to system","resource":"squat.ai/zwave","ts":"2023-04-05T01:31:47.627269533Z"}

It's not a problem, but I don't think this should be a "info" level message.

squat commented 1 year ago

Nice! Those are both really good pieces of information! If you feel up to it, I'd happily merge a PR that changes that line to a debug level message (no pressure, I can get to it later today). Yes, for the time being we should probably bump the memory requested by the plugin in the default manifest included in the repository. In the medium term we should try to look at why memory usage has increased. I'll do some memory profiling to see if we can get it back down. Thanks again @dejanzelic

dejanzelic commented 1 year ago

Sweet! I'll submit a pr for the log level thing when I'm in front of my computer.

TakahiroW4047 commented 1 year ago

Hello,

Not sure if this related, but I seem to be encountering a memory leak where Generic Device process is killed due to "Memory cgroup out of memory" (see dmesg log below).

I'm currently running Kubernetes 1.27 with the latest release of your 'generic-device-plugin' as of today. Running on RaspberryPi4B 8gb clusters. Ubuntu Server 22.04 LTS

[145926.626040] generic-device- invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-997 [145926.626083] CPU: 2 PID: 298580 Comm: generic-device- Tainted: G C E 5.15.0-1033-raspi #36-Ubuntu [145926.626097] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT) [145926.626104] Call trace: [145926.626109] dump_backtrace+0x0/0x200 [145926.626126] show_stack+0x20/0x30 [145926.626136] dump_stack_lvl+0x8c/0xb8 [145926.626150] dump_stack+0x18/0x34 [145926.626159] dump_header+0x54/0x21c [145926.626173] oom_kill_process+0x22c/0x230 [145926.626184] out_of_memory+0xf4/0x370 [145926.626193] mem_cgroup_out_of_memory+0x150/0x184 [145926.626206] try_charge_memcg+0x5c4/0x670 [145926.626218] charge_memcg+0x5c/0x100 [145926.626229] mem_cgroup_charge+0x40/0x8c [145926.626242] add_to_page_cache_locked+0x20c/0x3c0 [145926.626255] add_to_page_cache_lru+0x5c/0x100 [145926.626266] pagecache_get_page+0x1d0/0x624 [145926.626278] filemap_fault+0x588/0x830 [145926.626289] do_fault+0x44/0xe0 [145926.626301] do_read_fault+0xe4/0x1b0 [145926.626313] do_fault+0xc0/0x360 [145926.626325] handle_pte_fault+0x5c/0x1c0 [145926.626336] handle_mm_fault+0x1d0/0x350 [145926.626348] handle_mm_fault+0x108/0x294 [145926.626360] do_page_fault+0x160/0x560 [145926.626369] do_translation_fault+0x98/0xf0 [145926.626378] do_mem_abort+0x4c/0xbc [145926.626387] el0_ia+0x9c/0x204 [145926.626397] el0t_64_sync_handler+0x124/0x12c [145926.626407] el0t_64_sync+0x1a4/0x1a8 [145926.626417] memory: usage 10160kB, limit 10240kB, failcnt 27595 [145926.626429] swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [145926.626440] Memory cgroup stats for /kubepods.slice/kubepods-pod54d889da_b92a_49f8_a602_5ed21cc852e5.slice: [145926.626505] anon 9568256 file 139264 kernel_stack 196608 pagetables 143360 percpu 22464 sock 0 shmem 0 file_mapped 0 file_dirty 0 file_writeback 0 swapcached 0 anon_thp 0 file_thp 0 shmem_thp 0 inactive_anon 9560064 active_anon 8192 inactive_file 98304 active_file 0 unevictable 0 slab_reclaimable 91808 slab_unreclaimable 123440 slab 215248 workingset_refault_anon 0 workingset_refault_file 29514 workingset_activate_anon 0 workingset_activate_file 92 workingset_restore_anon 0 workingset_restore_file 45 workingset_nodereclaim 0 pgfault 107712 pgmajfault 43 pgrefill 1264 pgscan 825908 pgsteal 29480 pgactivate 863 pgdeactivate 955 pglazyfree 0 pglazyfreed 0 thp_fault_alloc 0 thp_collapse_alloc 0 [145926.626525] Tasks state (memory values in pages): [145926.626533] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [145926.626550] [ 298513] 65535 298513 198 1 36864 0 -998 pause [145926.626575] [ 298550] 0 298550 181003 4262 122880 0 -997 generic-device- [145926.626595] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-e689c4e731d1a743a8861b71ea4d79648cd1cc8b80e1fe83e1faabdb114849ff.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-pod54d889da_b92a_49f8_a602_5ed21cc852e5.slice,task_memcg=/kubepods.slice/kubepods-pod54d889da_b92a_49f8_a602_5ed21cc852e5.slice/cri-containerd-e689c4e731d1a743a8861b71ea4d79648cd1cc8b80e1fe83e1faabdb114849ff.scope,task=generic-device-,pid=298550,uid=0 [145926.626815] Memory cgroup out of memory: Killed process 298550 (generic-device-) total-vm:724012kB, anon-rss:8736kB, file-rss:8312kB, shmem-rss:0kB, UID:0 pgtables:120kB oom_score_adj:-997

squat commented 1 year ago

xref: #45