Closed jvrobert closed 2 years ago
It looks gpu hang occurs..., may i know which codec is your decoder content? And it is much better if you can share the content to us and we can take a look locally.
I've run into the same issue on my end. On initial boot vainfo seems fine.
After attempting to execute:
ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i hdr_source.mp4 -vf tonemap_vaapi -c:v hevc_vaapi sdr_out.mp4 -y
it'll have a few moments before the application crashes.
Once it crashes the vaapi interface seems inoperable. I'll post the before and after below, as well as relevant exerpt from dmesg and FFMPEG logs.
The source file in question is a Sony HDR10 demo file called "Sony Swordsmith HDR UHD 4K Demo.mp4". it can be found in various places online, but the relevant characteristics are Video: hevc (Main 10) (hvc1 / 0x31637668), yuv420p10le(tv, bt2020nc/bt2020/smpte2084), 3840x2160 [SAR 1:1 DAR 16:9], 71382 kb/s, 59.94 fps, 59.94 tbr, 60k tbn (default)
First run of ffmpeg where it initially hangs:
After the initial crash:
Brief side note: to eliminate variables related to latency, core scheduler weirdness etc. the source file was copied to /tmp and then always referred to via soft link, as was the output file.
Ecores were disabled and kernel is 5.17.1 default from upstream in a debian/ubuntu environment.
Primary display is connected to a dGPU, however the iGPU does have a monitor connected to it. This monitor continued to function normally under wayland
The source file in question is a Sony HDR10 demo file called "Sony Swordsmith HDR UHD 4K Demo.mp4"
@FCLC Please provide a link, to make sure it's the same version you are using. GuC and HuC FW versions would be also good to know. (I'm not media developer, but these have been needed for bugs I've reported myself)
The source file in question is a Sony HDR10 demo file called "Sony Swordsmith HDR UHD 4K Demo.mp4"
@FCLC Please provide a link, to make sure it's the same version you are using. GuC and HuC FW versions would be also good to know. (I'm not media developer, but these have been needed for bugs I've reported myself)
Sure, file can be found here: https://4kmedia.org/sony-swordsmith-hdr-uhd-4k-demo/
MD5 of the file: a4dcfe93ab98d7e582b2554e7a8008c9 Sony Swordsmith HDR UHD 4K Demo.mp4
SHA1 if preferred: 292eb58f69ae17aabd576aff781953ca6bae9051 Sony Swordsmith HDR UHD 4K Demo.mp4
or SHA 512: e0e3a7f2402d154eb7d2c1f0a11c7915f98e1ea2ff8c6eb8293b14864958bcc6d392d264d647d4658e3f1e6bc193849faea1e13c3efa326522e7835e2cffe779 Sony Swordsmith HDR UHD 4K Demo.mp4
Firmware versions: GuC
GuC firmware: i915/tgl_guc_62.0.0.bin
status: LOADABLE
version: wanted 62.0, found 62.0
uCode: 325632 bytes
RSA: 256 bytes
GuC status 0x00000001:
Bootrom status = 0x0
uKernel status = 0x0
MIA Core status = 0x0
Scratch registers:
0: 0x0
1: 0x0
2: 0x0
3: 0x0
4: 0x0
5: 0x0
6: 0x0
7: 0x0
8: 0x0
9: 0x0
10: 0x0
11: 0x0
12: 0x0
13: 0x0
14: 0x0
15: 0x0
GuC log relay not created
HuC
HuC firmware: i915/tgl_huc_7.9.3.bin
status: LOADABLE
version: wanted 7.9, found 7.9
uCode: 589504 bytes
RSA: 256 bytes
HuC status: 0x00090001
Something I'm noticing now is that the i915 drivers seem to be using the tiger lake versions, and that the upstream firmware git at https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
does not have any firmware for ADL-S for GuC or HuC, only for ADL-P.
The reason this is off putting is that, though the VAAPI driver page (https://github.com/intel/media-driver#supported-platforms) lists ADL-S under TGLx, it also lists ADL-P, leading me to believe that perhaps a bug was found in using the TGLx driver on ADL-P, and that the same/similar bug may also be present on ADL-S, but has yet to be diagnosed/up streamed?
Simultaneously, the updated git has newer versions than was available in 5.17.1 mainline, namely guc was 62.0, but git has tgl_guc version 69.0.3. but is defaulting to 62.x instead
why this is is uncertain
here are 2 straces, one of a funtional amd vainfo dump and one of the failing vainfo attempt on alderlake-s log_amd.txt log_intel.txt
More debugging, after rebuilding the kernel and rebooting it continues to load version 62 by default.
I was able to use the tonemapping filter. However the issue seems to be recovering the device after a failed attempt. attempting to change the parameters for a second test, I killed the active ffmpeg command (ctrl c) and the iGPU segfaulted.
now back to square one.
This may be related to recovering the chip after an error/buffer overflow?
editing drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
manually to use version tgl 69.0.3 instead of tgl 62.0.0 may be a possible way forward.
Hi @FCLC What is your platfom, ADL-P or ADL-S? something is different between both.
editing drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c manually to use version tgl 69.0.3 instead of tgl 62.0.0 may be a possible way forward.
@FCLC Besides fixes, GuC has also API changes now and then, that's why specific i915 version loads specific GuC version. Therefore changing the GuC version from the i915 sources is not a good idea, unless you somehow know which versions are compatible with each other.
This seems to have a much wider scope than is already being discussed.
We see this with a different device:
'Device 32902:39497' Id:39497 (Driver: Intel iHD driver for Intel(R) Gen Graphics - 21.2.2 (1dd7d7f), Vendor: 32902) The device id resolves to: TigerLake-LP GT2 [Iris Xe Graphics]
It is not specific to a certain video, also not specific to HDR, neither to HEVC or 4k. It happens with a simple FullHD H,264 video.
Details can be found here: https://emby.media/community/index.php?/topic/107064-quicksync-works-once-per-boot-then-stops-working-and-uses-software/&do=findComment&comment=1130529
Let me know which information you might need. I can also instruct our tester do try certain things.
Thanks, softworkz
Hi @FCLC What is your platfom, ADL-P or ADL-S? something is different between both.
Hi @Jexu, platform is ADL-S, specifically a 12700-k
editing drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c manually to use version tgl 69.0.3 instead of tgl 62.0.0 may be a possible way forward.
@FCLC Besides fixes, GuC has also API changes now and then, that's why specific i915 version loads specific GuC version. Therefore changing the GuC version from the i915 sources is not a good idea, unless you somehow know which versions are compatible with each other.
Part of the thinking is that upstream git (Linux/next 2022-04-03 for the 5.18 cycle) already has TGL69.0.3 marked as the expected version for ADL-S, so in theory should be loadable.
I ran into build errors, typical of upstream build especially pre RC1 so was t able to experiment personally.
A different error that may or may not be related is HEVC encoding:
Any and all HEVC encoded content output by ADL-S seems to be coming out either a green or pink mess. Using the same source file above, I can change which colour is dominant by changing the time start parameter, so perhaps something to do with the inter frame data being written to the file? H264 does not experience this issue.
'Device 32902:39497' Id:39497 (Driver: Intel iHD driver for Intel(R) Gen Graphics - 21.2.2 (1dd7d7f), Vendor: 32902)
The device id resolves to: TigerLake-LP GT2 [Iris Xe Graphics] [...] softworkz
Both of these devices (tgl and ADL-S) load the same subversion of the HuC and GuC firmware per the docs and the driver source, so I'd presume that the point of overlap is there.
It seems to me that this may be related to resetting the state of the device? The kernel source has options for the heartbeat, hang detection and so on, but in this case I'm not seeing that detection being asserted/sent to kernel logs.
I'm attempting with 5.18-rc1 instead of next, will report back when I can
So summarize the issue you saw:
- The gpu hang occurs with ffmpeg transcode on tgl/adl-s. (Please give the log in /sys/class/drm/card0/error, to check if the hevc clip has real tile)
I can't speak for tgl, @softworkz could you perhaps ping your tester and have them run more exhaustive testing in the case where the previous testing doesnt answer the above?
As for ADL-S:
$ sudo cat /sys/class/drm/renderD128/device/drm/card0/error
cat: /sys/class/drm/renderD128/device/drm/card0/error: No such device
$ vainfo --display drm --device /dev/dri/renderD128
libva info: VA-API version 1.15.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_13
Segmentation fault (core dumped)
$ sudo cat /sys/class/drm/renderD128/device/drm/card0/error
cat: /sys/class/drm/renderD128/device/drm/card0/error: No such device
$ sudo cat /sys/class/drm/card0/error
cat: /sys/class/drm/card0/error: No such device
2. The gpu is crashed after first hang occurs and need to reboot to recover. (Please check ll /dev/dri after first hang; I915 driver/ guc fail to reset the gpu which normally should not happen)
Not certain what you mean here; if you mean if the device is still present in /dev, it is
ls /dev/dri/
by-path card0 card1 renderD128 renderD129
3. Do you try the ffmpeg decode only, without encode(transcode)?
Decode is fine prior to the initial crash.
testing with the standard ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -i source.mp4 -f null -
as a way to decode and send directly to null the decoder will operate fine prior to the crash.
After the first crash decode and encode are impossible.
after the test with 5.18-rc1 I will try
ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -i input.mp4 -c:v libx264 -crf 20 output.mp4
Per your test and log:
Just booted up into 5.18-rc1.
ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i ~/Videos/hdr_source.mp4 -f null -
was able to complete multiple times in a row without error.
full out put from the command is here:
subsequently running ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -i hdr_source.mp4 -c:v libx264 -crf 20 sdr_out.mp4 -y
Seems to be running fine (currently awaiting file to complete, encoding UHD H264 420 10bit@60fps isnt a small task even on the lastest of chips.)
The file is fine: (gnome screen cap of ffplay output image as POC)
I'm testing with the same outputting to libx265 now, and will follow up after wards with encoding using vappi
Good to know it works with 5.18-rc1.
Now attempting to run
ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i hdr_source.mp4 -vf tonemap_vaapi -c:v h264_vaapi sdr_out.mp4 -y
and
ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i hdr_source.mp4 -vf tonemap_vaapi -c:v hevc_vaapi sdr_out.mp4 -y
Unfortunately HEVC continues to be completely broken
As a sanity check, I've now run
HEVC p010 and nv12
ffmpeg -init_hw_device vaapi=decdev:/dev/dri/renderD128 -init_hw_device vaapi=encdev:/dev/dri/renderD129 -hwaccel vaapi -hwaccel_device decdev -hwaccel_output_format vaapi -i hdr_source.mp4 -filter_hw_device encdev -vf 'tonemap_vaapi,hwdownload,format=nv12,hwupload' -c:v hevc_vaapi -b:v 5M sdr_out.mp4
H264 nv12
ffmpeg -init_hw_device vaapi=decdev:/dev/dri/renderD128 -init_hw_device vaapi=encdev:/dev/dri/renderD129 -hwaccel vaapi -hwaccel_device decdev -hwaccel_output_format vaapi -i hdr_source.mp4 -filter_hw_device encdev -vf 'tonemap_vaapi,hwdownload,format=p010,hwupload' -c:v hevc_vaapi -b:v 5M sdr_out.mp4
They both use the intel iGPU and VAAPI to decode the HEVC 10 bit HDR file, tonemap it to SDR rec 709. They then pass the file on to a known good fully functional amd gpu VAAPI instance at dev/dri/rend129 for render at either
running ffmpeg -init_hw_device vaapi=decdev:/dev/dri/renderD128 -init_hw_device vaapi=encdev:/dev/dri/renderD129 -hwaccel vaapi -hwaccel_device decdev -hwaccel_output_format vaapi -i hdr_source.mp4 -filter_hw_device encdev -vf 'tonemap_vaapi=format=p010,hwdownload,hwupload' -c:v hevc_vaapi -b:v 30M sdr_out.mp4 -y
led to
This was cause by omitting the format=p010
between hwdownload
and hwupload
re adding the format fixes the issue, which makes me wonder if that may be responsible for the hevc_vaapi
issues above?
Testing only the hevc encoder with ffmpeg -hide_banner -loglevel verbose -vaapi_device /dev/dri/renderD128 -i hdr_source.mp4 -vf 'format=p010,hwupload' -c:v hevc_vaapi -b:v 15M -profile:v 2 sdr_out.mp4
Yields:
attempting hevc_qsv using: ffmpeg -hide_banner -loglevel verbose -init_hw_device qsv=hw -filter_hw_device hw -i hdr_source.mp4 -vf hwupload=extra_hw_frames=64,format=qsv -c:v hevc_qsv -b:v 30M sdr_out.mp4
as of now there's a few issues:
TGLx has issues with re-initializing the gpu if the device is killed durring an active session.
hevc_vaapi encode has an issue where by diverging from the QSV code path, it produces garbage data
interactions between QSV and vaapi seems broken when mapping between internal surfaces.
running ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_output_format vaapi -vaapi_device /dev/dri/renderD128 -i hdr_source.mp4 -vf 'scale_vaapi=1920:1080,hwmap=derive_device=qsv,format=qsv' -c:v hevc_qsv -b:v 30M sdr_out.mp4 -y
results in
Error while filtering: Cannot allocate memory
Failed to inject frame into filter network: Cannot allocate memory
Error while processing the decoded data for stream #0:0
[AVIOContext @ 0x22ec040] Statistics: 0 bytes written, 0 seeks, 0 writeouts
^C^C^CReceived > 3 system signals, hard exiting
running
`ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_output_format vaapi -vaapi_device /dev/dri/renderD128 -i hdr_source.mp4 -vf 'hwmap=derive_device=qsv,format=qsv' -c:v hevc_qsv -b:v 30M sdr_out.mp4 -y`
[hevc_qsv @ 0x310c440] Using input frames context (format qsv) with hevc_qsv encoder. [hevc_qsv @ 0x310c440] Encoder: input is video memory surface corrupted double-linked list Aborted (core dumped)
Finally, a piece of good news:
the issue of quitting an active session crashing the vaapi device completely to an unrecoverable state does seem to be solved in 5.18 rc-1
I've created a small bash script to begin testing speed and options more exhaustively.
one thing I'm noticing is that the qsv and vaapi encoders seem to have different performance characteristics.
running
$ cat benchmark.sh
#!/bin/bash
echo "vaapi render 128- intel"
ffmpeg -loglevel quiet -stats -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i hdr_source.mp4 -f null -
echo "QSV render 128- intel"
ffmpeg -loglevel quiet -stats -hwaccel qsv -c:v hevc_qsv -i hdr_source.mp4 -f null -
results in:
vaapi render 128- intel
frame= 5160 fps=368 q=-0.0 Lsize=N/A time=00:01:26.10 bitrate=N/A speed=6.14x
QSV render 128- intel
frame= 5160 fps=344 q=-0.0 Lsize=N/A time=00:01:26.10 bitrate=N/A speed=5.75x
Which means that the vaapi path is outperforming the QSV code path
@jvrobert Please try 5.18 rc-1 as @FCLC said to check if it helps for your issue and help close this if solved.
one thing I'm noticing is that the qsv and vaapi encoders seem to have different performance characteristics.
@FCLC perf is out of scope for this ticket.
FYI: These ffmpeg backends have some differences in how they do threading, syncing, allocation etc, and this can change from FFmpeg version to another. See for example these old tickets of mine:
If you would use Gstreamer, or "sample_multi_transcode" tool from MediaSDK / OneVPL to do the same thing, you'd likely see some differences to FFmpeg perf with these APIs too, for same reasons. But yes, VA-API has been faster than QSV in about everything since FFmpeg git HEAD fixed 3 year old VA-API perf regression a bit over month ago.
But yes, VA-API has been faster than QSV in about everything since FFmpeg git HEAD fixed 3 year old VA-API perf regression a bit over month ago.
I can't confirm this. From our experience on many different user systems, QSV has shown better performance in almost every case with H.264 encoding and various decoders and processing filters.
Which encoding codec are you talking about?
@jvrobert Please try 5.18 rc-1 as @FCLC said to check if it helps for your issue and help close this if solved.
Do I understand correctly that the only way to fix this is to change the kernel?
Which encoding codec are you talking about?
@softworkz See e.g. the above FFmpeg 7706 ticket for QSV & VA-API command lines and commit ID you need (there's no FFmpeg release yet that would include fix to that large 3 years old VA-API perf regression, you need to build Git master yourself).
Do I understand correctly that the only way to fix this is to change the kernel?
While GPU hang could be due either kernel or user-space driver issue, if hang recovery fails, that's always a kernel bug (separate from the hang itself).
@eero-t - you wrote:
But yes, VA-API has been faster than QSV in about everything
while the ticket 7706 says:
VAAPI H264 transcode performance dropped 20-30%
While GPU hang could be due either kernel or user-space driver issue, if hang recovery fails, that's always a kernel bug (separate from the hang itself).
I don't understand. What does this mean? In the past years there didn't exist any case where the only way to use that hardware would have been to change the kernel version. Many of our users can't, don't want, aren't allowed or aren't able to make such change.
@softworkz While I'm answering you once more, none of this is relevant to the media-driver / GPU hang discussion here. Please ask your questions in the appropriate place for them. Kernel bugs & updates belong to kernel driver projects (in this case, i915) and/or upstream kernel. FFmpeg performance belongs to FFmpeg project (e.g. tickets I linked).
I don't understand. What does this mean? In the past years there didn't exist any case where the only way to use that hardware would have been to change the kernel version.
Only kernel can do device hang recovery. If that is buggy, you need a fix. Fix is in kernel. Besides bug fixes, you often need kernel updates also to get support for new HW devices.
Many of our users can't, don't want, aren't allowed or aren't able to make such change.
If you cannot get / build newer kernel now, then you obviously need to wait.
E.g. Ubuntu LTS releases get new HWE (hardware enabling) kernels every few months: https://wiki.ubuntu.com/Kernel/LTSEnablementStack
And enterprise distros occasionally backport fixes from latest upstream to their ancient kernel versions. If you want to expedite that process, let your ISV know the importance (and existence) of given kernel fix.
But yes, VA-API has been faster than QSV in about everything
while the ticket 7706 says:
VAAPI H264 transcode performance dropped 20-30%
That (3 year old) regression was fixed in FFmpeg master over month ago. Before that regression, and after its fix, VA-API backend in FFmpeg is faster than QSV one in my tests. During the 3 years while that perf regression was in effect, doing single transcode with FFmpeg was faster with QSV than VA-API in those tests, but that was FFmpeg bug.
(Note: this was on Ubuntu i.e. using powersave / ondemand governor, with latest drm-tip Git kernel in addition to latest media stack from Git. Doing many parallel transcode operations in parallel, was still in general faster with VA-API during that period in my tests though, QSV was faster only when doing single transcode instance.)
While I'm answering you once more
This is very generous of you.
Before that regression, and after its fix, VA-API backend in FFmpeg is faster than QSV one in my tests. During the 3 years while that perf regression was in effect, doing single transcode with FFmpeg was faster with QSV than VA-API in those tests, but that was FFmpeg bug.
Thanks for the explanation, the timeline wasn't obvious.
FFmpeg performance belongs to FFmpeg project (e.g. tickets I linked).
I'm afraid, but I responded to your comment. I hadn't brought up that subject.
none of this is relevant to the media-driver / GPU hang discussion here
Could you kindly let me know whether the symptoms I had referenced (https://emby.media/community/index.php?/topic/107064-quicksync-works-once-per-boot-then-stops-working-and-uses-software/page/2/#comment-1130529) are relevant to the
media-driver / GPU hang discussion here
or solely a matter of that kernel update?
Thanks, sw
one thing I'm noticing is that the qsv and vaapi encoders seem to have different performance characteristics.
@FCLC perf is out of scope for this ticket.
FYI: These ffmpeg backends have some differences in how they do threading, syncing, allocation etc, and this can change from FFmpeg version to another. See for example these old tickets of mine:
* mem: https://trac.ffmpeg.org/ticket/7943 * perf: https://trac.ffmpeg.org/ticket/7706 * perf: https://trac.ffmpeg.org/ticket/7690
If you would use Gstreamer, or "sample_multi_transcode" tool from MediaSDK / OneVPL to do the same thing, you'd likely see some differences to FFmpeg perf with these APIs too, for same reasons. But yes, VA-API has been faster than QSV in about everything since FFmpeg git HEAD fixed 3 year old VA-API perf regression a bit over month ago.
Sounds good, was more so a minor observation as a side effect of different testing scenarios.
@jvrobert Please try 5.18 rc-1 as @FCLC said to check if it helps for your issue and help close this if solved.
Do I understand correctly that the only way to fix this is to change the kernel?
more so this may be a way forward that I've found.
Once we bisect what the difference is that is causing the fix, it can be backported to LTS kernels used by (examples on the ubuntu side: 18.04, 20.04) with kernels 5.4.x, 5.10.x etc.
for those not experienced in self building kernels:
The following is a known good kernel config for adl-s on a 12700k running pop-os (basically modified version of ubuntu that is rolling release):
You'll have to rename to .config
.
mv config.txt .config
will do it
make menuconfig
to double check that everything seems fine
NB:
if trying to build 5.18 RC series kernels, Linus has insisted on re-enabling the -werror
parameter for gcc in kernel builds, meaning that RC-1 failed until I went and disabled certain complaining modules around gvtg and kvm as well as werror checking for certain areas.
A unique setup of my environment is that I'm running GCC-12 for developing avx512-fp16 BLAS kernels, normal builds don't need this and should instead use gcc-11.2 mainline
@eero-t @Jexu regarding the HEVC encoder issues seen in https://github.com/intel/media-driver/issues/1342#issuecomment-1091817370 and https://github.com/intel/media-driver/issues/1342#issuecomment-1091913451
in reference to the second issue listed below, it has not been documented in https://github.com/intel/media-driver#known-issues-and-limitations
as of now there's a few issues:
1. TGLx has issues with re-initializing the gpu if the device is killed durring an active session. 2. hevc_vaapi encode has an issue where by diverging from the QSV code path, it produces garbage data 3. interactions between QSV and vaapi seems broken when mapping between internal surfaces. running `ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_output_format vaapi -vaapi_device /dev/dri/renderD128 -i hdr_source.mp4 -vf 'scale_vaapi=1920:1080,hwmap=derive_device=qsv,format=qsv' -c:v hevc_qsv -b:v 30M sdr_out.mp4 -y ` results in
Error while filtering: Cannot allocate memory Failed to inject frame into filter network: Cannot allocate memory Error while processing the decoded data for stream #0:0 [AVIOContext @ 0x22ec040] Statistics: 0 bytes written, 0 seeks, 0 writeouts ^C^C^CReceived > 3 system signals, hard exiting
and I also don't see anything regarding issue 3.
Issue 2 seems to me as a media-driver related issue and should be solved here.
Issue 3 may be a combination off ffmpeg hardware surface mappings as well as intel libva driver issues. However the above command works fine on previous generation chips, so I'm erring towards the side of the issue being on the graphics stack side of things.
Should we be opening new issues for these?
or solely a matter of that kernel update?
@softworkz Anything where reboot is needed to restore GPU to a working state, is a kernel (or FW) bug. Looking at your dmesg, issue could be also on FW side, but that will typically also need kernel update, as specific kernel versions load only specific FW versions (that have compatible API/ABI).
or solely a matter of that kernel update?
@softworkz Anything where reboot is needed to restore GPU to a working state, is a kernel (or FW) bug. Looking at your dmesg, issue could be also on FW side, but that will typically also need kernel update, as specific kernel versions load only specific FW versions (that have compatible API/ABI).
Thanks a lot, that makes the situation more clear, but also even more unpleasant (or almost impossible) to deliver as part of an installation package (essentially a whole range of installation packages for multiple distros and platforms). I choose the blue pill... :-)
For an update, you do not necessarily need to replace whole kernel or do reboot (unless GPU is already in unrecoverable state). Just modprobing updated i915 module (after installing compatible FW) can be enough, but that is not necessarily easier. It still needs to be built for that particular kernel version, and to modprobe new version, you need to rmmod old version first (which can be hard if you do not know what is blocking that).
Thanks a lot. I'll see what our packaging expert will say, but it doesn't really sound feasible.
Trying to look at it from a different angle: what do you think how long it might take until this turns into a rare issue, only affecting a very small percentage of Linux installations (of all flavors)?
Trying to look at it from a different angle: what do you think how long it might take until this turns into a rare issue, only affecting a very small percentage of Linux installations (of all flavors)?
will depend very much on the distributions that emby has in their LTS pipe.
For now something that you may want to consider is that on installation, check
if (platform == adl-s || platform == adl-n || platform == adl-p || platform == tgl || platform == rkl || platform == rpl-s ||) {
if (kernel version < 5.18) { print "platform has known issues with VAAPI using intel iGPU's. Disabling VAAPI and falling back on QSV and software filters"
} }
perform this check on updates? I'd assume emby has a working mechanism checking for available mechanisms, so this shouldn't be too hard to add in as another condition.
the more precises way to check would be via the huc and guc firmware versions, both of which can be checked via sys
I'd assume emby has a working mechanism checking for available mechanisms, so this shouldn't be too hard to add in as another condition.
Yes, we have a detection calling libva directly.
the more precises way to check would be via the huc and guck firmware versions,
I guess you mean similar to this?
struct drm_i915_getparam gp;
int fd = open("/dev/dri/renderD128", O_RDWR);
gp.param = I915_PARAM_HUC_STATUS;
gp.value = value;
drmCommandWriteRead(fd, DRM_I915_GETPARAM, &gp, sizeof(gp)) == 0;
print "platform has known issues with VAAPI using intel iGPU's. Disabling VAAPI and falling back on QSV and software filters"
Yup, that's similar to the plan I already made for JSL/EHL, which I previously thought would be the one and only painpoint..
I guess you mean similar to this?
struct drm_i915_getparam gp; int fd = open("/dev/dri/renderD128", O_RDWR); gp.param = I915_PARAM_HUC_STATUS; gp.value = value; drmCommandWriteRead(fd, DRM_I915_GETPARAM, &gp, sizeof(gp)) == 0;
That should be workable. Otherwise if you're using something like a bash script for configure/install, you could cat /sys/kernel/debug/dri/0/gt/uc/guc_info
and then also cat /sys/kernel/debug/dri/0/gt/uc/huc_info
edit:
example output on 5.18-rc1:
~/Videos$ sudo cat /sys/kernel/debug/dri/0/gt/uc/guc_info
GuC firmware: i915/tgl_guc_69.0.3.bin
status: RUNNING
version: wanted 69.0, found 69.0
uCode: 342912 bytes
RSA: 256 bytes
GuC status 0x8003f0ec:
Bootrom status = 0x76
uKernel status = 0xf0
MIA Core status = 0x3
Scratch registers:
0: 0x0
1: 0x163fdf
2: 0x40000
3: 0x4000
4: 0x40
5: 0x2ec8
6: 0x4680000c
7: 0x0
8: 0x0
9: 0x0
10: 0x0
11: 0x0
12: 0x0
13: 0x0
14: 0x0
15: 0x0
GuC log relay not created
~/Videos$ sudo cat /sys/kernel/debug/dri/0/gt/uc/huc_info
HuC firmware: i915/tgl_huc_7.9.3.bin
status: RUNNING
version: wanted 7.9, found 7.9
uCode: 589504 bytes
RSA: 256 bytes
HuC status: 0x00090001
Yup, that's similar to the plan I already made for JSL/EHL, which I previously thought would be the one and only painpoint..
for what it's worth, I'm already having to do similar things to get around the lack of HDR support in the AMDGPU vaapi driver stack.
(Thankfully we're well past the dark days of OpenCL 1.0)
get around the lack of HDR support in the AMDGPU vaapi driver
They are doing so little to get better AMD support into ffmpeg...we just have minimal support for these..
falling back on QSV and software filters
Why "falling back on QSV"?
System information
model name : 12th Gen Intel(R) Core(TM) i7-12700K 00:02.0 VGA compatible controller [0300]: Intel Corporation AlderLake-S GT1 [8086:4680] (rev 0c) no display, render only in ffmpeg
Issue behavior
Describe the current behavior
When using the latest compiled media driver and ffmpeg 5 (also happens on 4.x) with latest drm-tip kernel/linuxfirmware bins (also happens on Ubuntu 20.04 HW kernel), ffmpeg (running under Frigate NVR) will support hw acceleration using either qsv or vaapi decode for somewhere between 10-30 minutes (usually, sometimes longer). After that, it crashes the GPU with this error: [ 4009.472554] i915 0000:00:02.0: [drm] Resetting vcs1 for preemption time out [ 4009.474067] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [27844] [ 4020.835642] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [27844] [ 4020.836679] i915 0000:00:02.0: [drm] Resetting vcs1 for stopped heartbeat on vcs1 [ 4020.837224] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs1 [ 4020.939613] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [ 4021.028683] i915 0000:00:02.0: [drm] ERROR Failed to reset chip [ 4021.028762] i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_res et+0x25b/0x2d0 [i915] [ 4021.131605] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [ 4021.133494] i915 0000:00:02.0: [drm] ffmpeg[27844] context reset due to GPU hang [ 4023.672616] ffmpeg[27894]: segfault at 0 ip 0000000000000000 sp 00007fff30a1add8 error 14 i n ffmpeg[556214dda000+b000]
ffmpeg settings: -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format yuv420p
Describe the expected behavior
Not crash.
Debug information
Note re: vainfo, I also tried a new container with ffmpeg and compiled latest version of vainfo, media driver, gmm, everything - same issue.
root@6d859362545b:/opt/frigate# vainfo error: XDG_RUNTIME_DIR not set in the environment. error: can't connect to X server! libva info: VA-API version 1.12.0 libva info: User environment variable requested driver 'iHD' libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so libva info: Found init function __vaDriverInit_1_12 libva info: va_openDriver() returns 0 vainfo: VA-API version: 1.12 (libva 2.12.0) vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 21.3.3 (6fdf88c) vainfo: Supported profile and entrypoints VAProfileNone : VAEntrypointVideoProc VAProfileNone : VAEntrypointStats VAProfileMPEG2Simple : VAEntrypointVLD VAProfileMPEG2Simple : VAEntrypointEncSlice VAProfileMPEG2Main : VAEntrypointVLD VAProfileMPEG2Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointVLD VAProfileH264Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointFEI VAProfileH264Main : VAEntrypointEncSliceLP VAProfileH264High : VAEntrypointVLD VAProfileH264High : VAEntrypointEncSlice VAProfileH264High : VAEntrypointFEI VAProfileH264High : VAEntrypointEncSliceLP VAProfileVC1Simple : VAEntrypointVLD VAProfileVC1Main : VAEntrypointVLD VAProfileVC1Advanced : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointEncPicture VAProfileH264ConstrainedBaseline: VAEntrypointVLD VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice VAProfileH264ConstrainedBaseline: VAEntrypointFEI VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP VAProfileHEVCMain : VAEntrypointVLD VAProfileHEVCMain : VAEntrypointEncSlice VAProfileHEVCMain : VAEntrypointFEI VAProfileHEVCMain : VAEntrypointEncSliceLP VAProfileHEVCMain10 : VAEntrypointVLD VAProfileHEVCMain10 : VAEntrypointEncSlice VAProfileHEVCMain10 : VAEntrypointEncSliceLP VAProfileVP9Profile0 : VAEntrypointVLD VAProfileVP9Profile0 : VAEntrypointEncSliceLP VAProfileVP9Profile1 : VAEntrypointVLD VAProfileVP9Profile1 : VAEntrypointEncSliceLP VAProfileVP9Profile2 : VAEntrypointVLD VAProfileVP9Profile2 : VAEntrypointEncSliceLP VAProfileVP9Profile3 : VAEntrypointVLD VAProfileVP9Profile3 : VAEntrypointEncSliceLP VAProfileHEVCMain12 : VAEntrypointVLD VAProfileHEVCMain12 : VAEntrypointEncSlice VAProfileHEVCMain422_10 : VAEntrypointVLD VAProfileHEVCMain422_10 : VAEntrypointEncSlice VAProfileHEVCMain422_12 : VAEntrypointVLD VAProfileHEVCMain422_12 : VAEntrypointEncSlice VAProfileHEVCMain444 : VAEntrypointVLD VAProfileHEVCMain444 : VAEntrypointEncSliceLP VAProfileHEVCMain444_10 : VAEntrypointVLD VAProfileHEVCMain444_10 : VAEntrypointEncSliceLP VAProfileHEVCMain444_12 : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointEncSliceLP VAProfileHEVCSccMain10 : VAEntrypointVLD VAProfileHEVCSccMain10 : VAEntrypointEncSliceLP VAProfileHEVCSccMain444 : VAEntrypointVLD VAProfileHEVCSccMain444 : VAEntrypointEncSliceLP VAProfileAV1Profile0 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointEncSliceLP
export LIBVA_TRACE=/tmp/libva_trace.log
first then execute the case.Only useful logs from libva:
/tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273421][ctx 0x10000000]==========va_TraceEndPicture /tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273422][ctx 0x10000000] context = 0x10000000 /tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273422][ctx 0x10000000] render_targets = 0x0000001c /tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273504][ctx none]=========vaEndPicture ret = VA_STATUS_ERROR_DECODING_ERROR, internal decoding error /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000]==========va_TraceBeginPicture /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000] context = 0x10000000 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000] render_targets = 0x00000019 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000] frame_count = #7 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245558][ctx 0x10000000]==========va_TraceRenderPicture /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245558][ctx 0x10000000] context = 0x10000000 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245558][ctx 0x10000000] num_buffers = 2 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245559][ctx 0x10000000] --------------
Could you attach dmesg log if it's GPU hang by
dmesg >dmesg.log 2>&1
? [155523.319847] i915 0000:00:02.0: [drm:i915_gem_context_create_ioctl [i915]] HW context 16 created [155534.199385] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [102504] [155534.200411] i915 0000:00:02.0: [drm] Resetting vcs0 for stopped heartbeat on vcs0 [155534.200945] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs0 [155534.302952] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [155534.394325] i915 0000:00:02.0: [drm] ERROR Failed to reset chip [155534.394347] i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x258/0x2d0 [i915] [155534.497281] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [155534.499244] i915 0000:00:02.0: [drm] ffmpeg[102504] context reset due to GPU hang [155534.520720] intel_gt_invalidate_tlbs: 36 callbacks suppressed [155534.520734] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.525130] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms! [155534.531383] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.536543] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms! [155534.540749] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.546000] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms! [155534.551252] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.556511] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4msDo you want to contribute a patch to fix the issue? (yes/no):