Closed notsyncing closed 1 month ago
There's a i915 kernel driver regression on CCS (Compute Command Streamer) engine of DG2/Arc GPU.
https://gitlab.freedesktop.org/drm/intel/-/issues/10895
The faulty kernel commit in 6.6.26+ and 6.8.5+ is https://lore.kernel.org/stable/20240327155622.538140-4-andi.shyti@linux.intel.com/T/ And both Media-Driver and OpenCL are affected.
@JablonskiMateusz Could you please let the kernel developer aware of this regression?
This one seems to be happening with any kernel that has Spectre BHI fix in it. (Not sure if it is related or not and disabling it doesn't change anything.)
Happens with LTS kernels as well. Happens with compute runtime 24.13 as well.
Kernels with this issue:
Kernels that works fine:
Haven't tried kernels older than 6.6 LTS.
OS: Arch Linux
Compute Runtime: aur/intel-compute-runtime-bin 24.13.29138.7-1
CPU: AMD Ryzen 7 5800X3D
GPU: Intel ARC A770 16GB
To determine if the same issue caused this regression, you can try my custom Arch Linux kernel package with the CCS changes reverted: https://github.com/gnattu/linux/releases/tag/6.8.7-jelly
If this kernel fixes the issue, then it's likely related to the CCS change as well.
To determine if the same issue caused this regression, you can try my custom Arch Linux kernel package with the CCS changes reverted: https://github.com/gnattu/linux/releases/tag/6.8.7-jelly
If this kernel fixes the issue, then it's likely related to the CCS change as well.
That custom kernel works fine. So it is CCS related. Spectre BHI is just coincidence i guess.
An Intel developer has provided me with a series of patches to test the theory of a potential fix without a full revert. I have created an Arch Linux package for it, with the related patches attached:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-ccs_mode-4
Please be aware that this kernel is not a guaranteed fix and is only for testing purposes. If anyone has the time, please try this kernel and let me know if it fixes the issue or not.
An Intel developer has provided me with a series of patches to test the theory of a potential fix without a full revert. I have created an Arch Linux package for it, with the related patches attached:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-ccs_mode-4
Please be aware that this kernel is not a guaranteed fix and is only for testing purposes. If anyone has the time, please try this kernel and let me know if it fixes the issue or not.
This one doesn't work. But it has "gray / io-wait" CPU usage instead of "red / kernel" CPU usage on htop.
Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.
Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.
This works but performance is 1/4 of what it should be.
I can confirm that the applications no longer hang and got fixed with a kernel built with the new patches on drm-intel-gt-next but compute performance seems to be lowered on my system as well. Not sure if there is something going on with the compute runtime and these new changes or if it is out of scope for this discussion here.
Just got the latest patch from Intel developers: https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.
This works but performance is 1/4 of what it should be.
Can you try this kernel build to see if the performance improves?
Just got the latest patch from Intel developers: https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.
This works but performance is 1/4 of what it should be.
Can you try this kernel build to see if the performance improves?
This works fine and performance is pretty much the same as 6.8.4. Note: I am using a PCI-E 4.0 x4 port with a PCI-E riser so can't really say much about memcpy.
6.8.4 / 6.9.1-custom:
Just got the latest patch from Intel developers: https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.
This works but performance is 1/4 of what it should be.
Can you try this kernel build to see if the performance improves?
Can confirm that this is an improvement. I don't know how to run a benchmark to check memcopy but just comparing to a pretty intense blender file I have lying around, there's a noticeable improvement in the viewport
Now, with kernel 6.8.10 in fedora 40, it works, though performance is not great. I'm closing this. Thanks for everyone!
Now, with kernel 6.8.10 in fedora 40, it works, though performance is not great. I'm closing this. Thanks for everyone!
@notsyncing Full performance needs the patch mentioned above, drm/i915/gt: Fix CCS id's calculation for CCS mode setting but it has only been merged in kernel 6.10-rc2 so it will take a while to get to Fedora.
As https://github.com/intel/compute-runtime/issues/710, @Disty0 writes:
This happens to me as well on both llama.cpp and clpeak.
clpeak
output:Then it stuck here and
clpeak
process consumes one cpu core (100% usage).perf record -a
when it stuck reports:System information: