intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.1k stars 229 forks source link

Zombie processes #681

Open pvelesko opened 9 months ago

pvelesko commented 9 months ago

Is there a way to reload/reset the driver to get rid of these dead processes? These are a result of running various Level Zero tests overnight.

image

JablonskiMateusz commented 9 months ago

Hi @pvelesko What are these processes, who spawned them?

pvelesko commented 9 months ago

We have a large set of tests that we run on the Level Zero backend. Some of these tests produce these processes which I can't figure out how to kill. I tried unbinding dgpu from the i915 driver, I tried kill -9, but so far the only way I've found is to :

alias reboot_hardcore="history -a && sudo sh -c 'echo b > /proc/sysrq-trigger'"

A regular reboot doesn't work either as the system just hangs while waiting on something. Is there a way to reset the driver completely? Since we can't unload the driver while it's in use.

HoppeMateusz commented 9 months ago

Hello,

have you tried unbinding i915 like this ?

sudo sh -c "echo -n auto > /sys/bus/pci/devices/0000\:00\:02.0/power/control" sudo sh -c "echo -n "0000:00:02.0" > /sys/bus/pci/drivers/i915/unbind" sudo modprobe -r i915

PCI number can be found with:

lspci | grep -i display or lspci | grep -i vga

pvelesko commented 9 months ago

I'll try this, thank you.

pvelesko commented 9 months ago

modprobe: FATAL: Module i915 is in use. after unbinding devices.

HoppeMateusz commented 9 months ago

hello, thanks for a try, Unloading i915 might not work if KMD is hanging somewhere, reboot should be the ultimate way of resetting.

Can you provide details for reproducing the issue? What platform is it? OS version i915 version possibly a reproducer Thanks