kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/
Apache License 2.0
5.5k stars 1.06k forks source link

Creating a GPU container failed using cold_plug_vfio #10470

Open junqiang0718 opened 21 hours ago

junqiang0718 commented 21 hours ago

The creation command is as follows: root@node1:~# ctr --debug run --rm --runtime "io.containerd.kata.v2" --device /dev/vfio/218 -t "docker.io/nvidia/cuda:12.0.1-base-ubuntu20.04" cuda bash DEBU[0000] remote introspection plugin filters filters="[type==io.containerd.snapshotter.v1, id==overlayfs]" ctr: failed to create shim task: Failed to Check if grpc server is working: rpc error: code = DeadlineExceeded desc = timed out connecting to vsock 1521785742:1024: unknown

The qemu process information is as follows: root@node1:~# ps -ef | grep qemu root 1130054 1130042 98 15:49 ? 00:00:37 /opt/kata/bin/qemu-system-x86_64 -name sandbox-cuda,debug-threads=on -uuid 1e9953c3-eb30-461b-8390-6a1b5d5da516 -machine q35,accel=kvm,nvdimm=on -cpu host,pmu=off -qmp unix:fd=3,server=on,wait=off -m 2048M,slots=10,maxmem=774492M -device pci-bridge,bus=pcie.0,id=pci-bridge-0,chassis_nr=1,shpc=off,addr=2,io-reserve=4k,mem-reserve=1m,pref64-reserve=1m -device virtio-serial-pci,disable-modern=false,id=serial0 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/cuda/console.sock,server=on,wait=off -device nvdimm,id=nv0,memdev=mem0,unarmed=on -object memory-backend-file,id=mem0,mem-path=/root/kata-containers-20241027-v3.10.0.img,size=1342177280,readonly=on -device virtio-scsi-pci,id=scsi0,disable-modern=false -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0 -device pcie-root-port,id=rp0,bus=pcie.0,chassis=0,slot=0,multifunction=off,pref64-reserve=536870912B,mem-reserve=67108864B -device pcie-root-port,id=rp1,bus=pcie.0,chassis=0,slot=1,multifunction=off,pref64-reserve=536870912B,mem-reserve=67108864B -device vfio-pci,host=0000:ce:00.0,x-pci-vendor-id=0x10de,x-pci-device-id=0x2231,bus=rp0 -device vhost-vsock-pci,disable-modern=false,vhostfd=4,id=vsock-3325680279,guest-cid=3325680279 -chardev socket,id=char-e6058abd3f3a9ba9,path=/run/vc/vm/cuda/vhost-fs.sock -device vhost-user-fs-pci,chardev=char-e6058abd3f3a9ba9,tag=kataShared,queue-size=1024 -rtc base=utc,driftfix=slew,clock=host -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic --no-reboot -object memory-backend-file,id=dimm1,size=2048M,mem-path=/dev/shm,share=on -numa node,memdev=dimm1 -kernel /root/vmlinuz-5.16.16-nvidia-gpu -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 console=hvc0 console=hvc1 debug systemd.show_status=true systemd.log_level=debug panic=1 nr_cpus=128 selinux=0 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none agent.log=debug agent.debug_console agent.debug_console_vport=1026 agent.log=debug initcall_debug agent.hotplug_timeout=360 -pidfile /run/vc/vm/cuda/pid -smp 1,cores=1,threads=1,sockets=128,maxcpus=128

The containerd log is as follows: containerd-1.7.22-v3.log

junqiang0718 commented 21 hours ago

kata.log

junqiang0718 commented 21 hours ago

@zvonkok @Apokleos Please help me look at this question, Thanks.