ffmpeg HW acceleration crashes GPU on ADL

jvrobert commented 2 years ago

System information

model name : 12th Gen Intel(R) Core(TM) i7-12700K 00:02.0 VGA compatible controller [0300]: Intel Corporation AlderLake-S GT1 [8086:4680] (rev 0c) no display, render only in ffmpeg

Issue behavior

Describe the current behavior

When using the latest compiled media driver and ffmpeg 5 (also happens on 4.x) with latest drm-tip kernel/linuxfirmware bins (also happens on Ubuntu 20.04 HW kernel), ffmpeg (running under Frigate NVR) will support hw acceleration using either qsv or vaapi decode for somewhere between 10-30 minutes (usually, sometimes longer). After that, it crashes the GPU with this error: [ 4009.472554] i915 0000:00:02.0: [drm] Resetting vcs1 for preemption time out [ 4009.474067] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [27844] [ 4020.835642] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [27844] [ 4020.836679] i915 0000:00:02.0: [drm] Resetting vcs1 for stopped heartbeat on vcs1 [ 4020.837224] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs1 [ 4020.939613] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [ 4021.028683] i915 0000:00:02.0: [drm] ERROR Failed to reset chip [ 4021.028762] i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_res et+0x25b/0x2d0 [i915] [ 4021.131605] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [ 4021.133494] i915 0000:00:02.0: [drm] ffmpeg[27844] context reset due to GPU hang [ 4023.672616] ffmpeg[27894]: segfault at 0 ip 0000000000000000 sp 00007fff30a1add8 error 14 i n ffmpeg[556214dda000+b000]

ffmpeg settings: -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format yuv420p

Describe the expected behavior

Not crash.

Debug information

What's libva/libva-utils/gmmlib/media-driver version? root@6d859362545b:/opt/frigate# ls /usr/lib/x86_64-linux-gnu/mfx /usr/lib/x86_64-linux-gnu/libmfx.so.1 /usr/lib/x86_64-linux-gnu/libmfxhw64.so.1 /usr/lib/x86_64-linux-gnu/libmfx.so.1.35 /usr/lib/x86_64-linux-gnu/libmfxhw64.so.1.35

Note re: vainfo, I also tried a new container with ffmpeg and compiled latest version of vainfo, media driver, gmm, everything - same issue.

root@6d859362545b:/opt/frigate# vainfo error: XDG_RUNTIME_DIR not set in the environment. error: can't connect to X server! libva info: VA-API version 1.12.0 libva info: User environment variable requested driver 'iHD' libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so libva info: Found init function __vaDriverInit_1_12 libva info: va_openDriver() returns 0 vainfo: VA-API version: 1.12 (libva 2.12.0) vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 21.3.3 (6fdf88c) vainfo: Supported profile and entrypoints VAProfileNone : VAEntrypointVideoProc VAProfileNone : VAEntrypointStats VAProfileMPEG2Simple : VAEntrypointVLD VAProfileMPEG2Simple : VAEntrypointEncSlice VAProfileMPEG2Main : VAEntrypointVLD VAProfileMPEG2Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointVLD VAProfileH264Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointFEI VAProfileH264Main : VAEntrypointEncSliceLP VAProfileH264High : VAEntrypointVLD VAProfileH264High : VAEntrypointEncSlice VAProfileH264High : VAEntrypointFEI VAProfileH264High : VAEntrypointEncSliceLP VAProfileVC1Simple : VAEntrypointVLD VAProfileVC1Main : VAEntrypointVLD VAProfileVC1Advanced : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointEncPicture VAProfileH264ConstrainedBaseline: VAEntrypointVLD VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice VAProfileH264ConstrainedBaseline: VAEntrypointFEI VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP VAProfileHEVCMain : VAEntrypointVLD VAProfileHEVCMain : VAEntrypointEncSlice VAProfileHEVCMain : VAEntrypointFEI VAProfileHEVCMain : VAEntrypointEncSliceLP VAProfileHEVCMain10 : VAEntrypointVLD VAProfileHEVCMain10 : VAEntrypointEncSlice VAProfileHEVCMain10 : VAEntrypointEncSliceLP VAProfileVP9Profile0 : VAEntrypointVLD VAProfileVP9Profile0 : VAEntrypointEncSliceLP VAProfileVP9Profile1 : VAEntrypointVLD VAProfileVP9Profile1 : VAEntrypointEncSliceLP VAProfileVP9Profile2 : VAEntrypointVLD VAProfileVP9Profile2 : VAEntrypointEncSliceLP VAProfileVP9Profile3 : VAEntrypointVLD VAProfileVP9Profile3 : VAEntrypointEncSliceLP VAProfileHEVCMain12 : VAEntrypointVLD VAProfileHEVCMain12 : VAEntrypointEncSlice VAProfileHEVCMain422_10 : VAEntrypointVLD VAProfileHEVCMain422_10 : VAEntrypointEncSlice VAProfileHEVCMain422_12 : VAEntrypointVLD VAProfileHEVCMain422_12 : VAEntrypointEncSlice VAProfileHEVCMain444 : VAEntrypointVLD VAProfileHEVCMain444 : VAEntrypointEncSliceLP VAProfileHEVCMain444_10 : VAEntrypointVLD VAProfileHEVCMain444_10 : VAEntrypointEncSliceLP VAProfileHEVCMain444_12 : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointEncSliceLP VAProfileHEVCSccMain10 : VAEntrypointVLD VAProfileHEVCSccMain10 : VAEntrypointEncSliceLP VAProfileHEVCSccMain444 : VAEntrypointVLD VAProfileHEVCSccMain444 : VAEntrypointEncSliceLP VAProfileAV1Profile0 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointEncSliceLP

Could you provide libva trace log if possible? Run cmd export LIBVA_TRACE=/tmp/libva_trace.log first then execute the case.

Only useful logs from libva:

/tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273421][ctx 0x10000000]==========va_TraceEndPicture /tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273422][ctx 0x10000000] context = 0x10000000 /tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273422][ctx 0x10000000] render_targets = 0x0000001c /tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273504][ctx none]=========vaEndPicture ret = VA_STATUS_ERROR_DECODING_ERROR, internal decoding error /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000]==========va_TraceBeginPicture /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000] context = 0x10000000 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000] render_targets = 0x00000019 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000] frame_count = #7 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245558][ctx 0x10000000]==========va_TraceRenderPicture /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245558][ctx 0x10000000] context = 0x10000000 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245558][ctx 0x10000000] num_buffers = 2 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245559][ctx 0x10000000] --------------

Could you attach dmesg log if it's GPU hang by dmesg >dmesg.log 2>&1? [155523.319847] i915 0000:00:02.0: [drm:i915_gem_context_create_ioctl [i915]] HW context 16 created [155534.199385] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [102504] [155534.200411] i915 0000:00:02.0: [drm] Resetting vcs0 for stopped heartbeat on vcs0 [155534.200945] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs0 [155534.302952] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [155534.394325] i915 0000:00:02.0: [drm] ERROR Failed to reset chip [155534.394347] i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x258/0x2d0 [i915] [155534.497281] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [155534.499244] i915 0000:00:02.0: [drm] ffmpeg[102504] context reset due to GPU hang [155534.520720] intel_gt_invalidate_tlbs: 36 callbacks suppressed [155534.520734] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.525130] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms! [155534.531383] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.536543] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms! [155534.540749] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.546000] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms! [155534.551252] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.556511] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms
Do you want to contribute a patch to fix the issue? (yes/no):

softworkz commented 2 years ago

(Thankfully we're well past the dark days of OpenCL 1.0)

You mean because of surface sharing vendor extensions?

FCLC commented 2 years ago

Why "falling back on QSV"?

QSV seem's to perform ~7% worse than the VAAPI codepath for hevc decoding for example, but seems to be unaffected by the GPU crash that is causing the initial crash I encountered here.
You'd lose some performance, but it would allow for some hardware acceleration via the dedicated quicksync hardware blocks while still being reliable.

(Thankfully we're well past the dark days of OpenCL 1.0)

You mean because of surface sharing vendor extensions?

More so a comment on how long it took for any of the major hardware vendors to get serious about media support on linux, specifically open source support back in the day

get around the lack of HDR support in the AMDGPU vaapi driver

They are doing so little to get better AMD support into ffmpeg...we just have minimal support for these..

In the sense of AMD putting in work or FFMPEG putting in work?

it's a little off topic, but ffmpeg supports AMF as much as possible (see https://ffmpeg.org/general.html#toc-AMD-AMF_002fVCE). The reason for much better support for intel chips has been the amount of work put in by intel engineers on the mailing lists, same as the work they put into the kernel for example

softworkz commented 2 years ago

QSV seem's to perform ~7% worse than the VAAPI codepath for hevc decoding for example, but seems to be unaffected by the GPU crash that is causing the initial crash I encountered here.

Oh, I totally forgot to ask about trying QSV, I thought I had..

n the sense of AMD putting in work or FFMPEG putting in work?

it's a little off topic, but ffmpeg supports AMF as much as possible (see ffmpeg.org/general.html#toc-AMD-AMF_002fVCE). The reason for much better support for intel chips has been the amount of work put in by intel engineers on the mailing lists, same as the work they put into the kernel for example

That's what I meant - there's not much effort taken when comparing, no filtering and no decoders. But anyway, that's really OT..

Back to Intel, we're supporting for many years on Linux and Windows, and I thought I know all and everything, but until few months ago I had never heard about GuC and HuC. So that whole situation appears a bit like a bad dream.. ;-)

softworkz commented 2 years ago

QSV seem's to perform ~7% worse than the VAAPI codepath for hevc decoding for example, but seems to be unaffected by the GPU crash that is causing the initial crash I encountered here.

Oh, I totally forgot to ask about trying QSV, I thought I had..

User said he had the same symptom with QSV (encoding and decoding) Unfortunately he also said that he's got enough of this and will return the device to the store.

FCLC commented 2 years ago

User said he had the same symptom with QSV (encoding and decoding) Unfortunately he also said that he's got enough of this and will return the device to the store.

it seems I've been able to recreate the QSV behaviour on 5.18 rc-1 as well in very specific scenarios.

while running ffmpeg -hwaccel qsv -c:v hevc_qsv -i hdr_source.mp4 -vf 'vpp_qsv=framerate=60,scale_qsv=w=1920:h=1080:format=nv12' -c:v h264_qsv output/output/sdr_out.mp4 -y if the command is interrupted the GPU returns to an unrecoverable state

softworkz commented 2 years ago

command is interrupted

q or sigterm?

FCLC commented 2 years ago

q or sigterm?

I did it by accident, need to reboot the machine now and double check 😅

Edit: it's neither. ~15-20 seconds into transcoding I received

[Parsed_vpp_qsv_0 @ 0x23ecc00] Error running VPP: unknown error (-21)890.5kbits/s speed=2.08x    
Error while filtering: Unknown error occurred
Failed to inject frame into filter network: Unknown error occurred
Error while processing the decoded data for stream #0:0

and the application did not exit.

it then needs to be hard exited via 3 system signals.

softworkz commented 2 years ago

MFX_ERR_GPU_HANG = -21, /* device operation failure caused by GPU hang */

softworkz commented 2 years ago

Have you ever tried what happens when you disable HuC and GuC?

FCLC commented 2 years ago

Have you ever tried what happens when you disable HuC and GuC?

Without the firmware loadout the iGPU's would not be available at all. They are loaded in by the kernel as a driver via i915.

further work:

simplifying the command to ffmpeg -hwaccel qsv -c:v hevc_qsv -hwaccel_output_format qsv -i hdr_source.mp4 -vf 'scale_qsv=w=1920:h=1080:format=nv12' -c:v h264_qsv output/output/sdr_out.mp4 -y Crashes in ~ 3-4 seconds, compared to about 15-20 before.

softworkz commented 2 years ago

Without the firmware loadout the iGPU's would not be available at all. They are loaded in by the kernel as a driver via i915.

I mean to set i915.enable_guc=0

It's not a requirement - at least not on TGL and below.

softworkz commented 2 years ago

I had realized that I have a notebook here with the exact same graphics as the other user (TGL). And it just works fine.

The one thing that's special about it: I haven't updated the OS or OS components for more than a year. GuC and HuC are off

FCLC commented 2 years ago

I had realized that I have a notebook here with the exact same graphics as the other user (TGL). And it just works fine.

Can you double check that it is using the iGPU? Also can you check using the commands above and see how your chip reacts?

RE GuC/HuC

Interesting. from the documentation HuC and GuC support seem needed for some features. I'll try it now.

softworkz commented 2 years ago

HuC is mandatory for JSL and EHL, because these two don't support "Es" encoding but only "E" (low power..)

Es (Hardware(PAK) + Shader(media kernel+VME) Encoding) is not supported on JSL & EHL, only E (Hardware Encoding, Low Power Encoding(VDEnc/Huc)) is supported, however E depends on GuC / HuC firmwares.

https://github.com/intel/media-driver#decodingencoding-features

softworkz commented 2 years ago

Can you double check that it is using the iGPU? Also can you check using the commands above and see how your chip reacts?

I don't need to check, because it still shows a tonemap_vaapi bug that I had reported a year ago, It's fixed, but I never updated the machine, so it's still showing ;-)

Also:

FCLC commented 2 years ago

Could you load an ISO on a usb drive perhaps? as a way to do A B testing of some sort?

softworkz commented 2 years ago

I don't get that - why?

FCLC commented 2 years ago

I don't get that - why?

if it is a question of version, bisecting where the problem began could be a way to accelerate tracking down and closing this issue (where the regression was introduced etc.)

edit:

for example, what is the first operating system/software version that can and cannot run

ffmpeg -hwaccel qsv -c:v hevc_qsv -i hdr_source.mp4 -vf 'hwupload=extra_hw_frames=64,format=qsv' -c:v hevc_qsv -profile:v main10 output/output/sdr_out.mp4 -y

softworkz commented 2 years ago

Ah - you mean ISO of OS? I thought you meant video ISO like Bluray..

FCLC commented 2 years ago

Ah - you mean ISO of OS? I thought you meant video ISO like Bluray..

LOL, my bad, the joys of mixed terminologies. Yes, I mean operating system/software version/system firmware

softworkz commented 2 years ago

I'm not sure whether I got passion to try different OS versions.

But the one I have is: Operating system: Linux version 5.8.0-53-generic (buildd@lcy01-amd64-012) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #60~20.04.1-

Driver Info: 'TigerLake-LP GT2 Iris Xe Graphics' Id:39497 (Driver: Intel iHD driver for Intel(R) Gen Graphics - 21.2.0 (4436d2f), Vendor: Intel Corporation)

softworkz commented 2 years ago

No problem with QSV either, including OpenCL:

FCLC commented 2 years ago

No problem with QSV either, including OpenCL:

Interesting, thanks for this!

Could you send the results of

sudo cat /sys/kernel/debug/dri/0/gt/uc/huc_info

sudo cat /sys/kernel/debug/dri/0/gt/uc/guc_info

(This is me doing a sanity check for myself)

Would it also be possible to send the corresponding ffmpeg command of the pipeline from the diagram above? Just to make sure that anything I try on my end is identical.

Ideally we both also use the same source files (perhaps use the Sony demo file I linked near the top of the issue?)

softworkz commented 2 years ago

sudo cat /sys/kernel/debug/dri/0/gt/uc/huc_info

sudo cat /sys/kernel/debug/dri/0/gt/uc/guc_info

This was the first I had done and why I said that both are disabled

softworkz commented 2 years ago

/opt/emby-server/bin/ffmpeg -loglevel +timing -y -print_graphs_file "/var/lib/emby/logs/ffmpeg-transcode-191bc797-9e2d-4a71-acae-91af9efc102f_1graph.txt" -copyts -start_at_zero -qsv_device /dev/dri/renderD128 -f mp4 -c:v:0 hevc_qsv -hwaccel:v:0 qsv -i "/home/jay/Videos/HDR/Sony Swordsmith HDR UHD 4K Demo.mp4" -filter_complex "[0:0]vpp_qsv@f1=width=640:height=360,setparams@f2=color_primaries=bt2020:color_trc=smpte2084:colorspace=bt2020nc,hwmap@f3=mode=+read:derive_device=opencl,tonemap_opencl@f4=tonemap=hable:format=nv12:desat=0,hwmap@f5=mode=+write:derive_device=qsv:reverse=1:extra_hw_frames=16[f5_out0]" -map [f5_out0] -map 0:1 -sn -c:v:0 h264_qsv -b:v:0 1808000 -g:v:0 180 -maxrate:v:0 1808000 -bufsize:v:0 3616000 -sc_threshold:v:0 0 -level:v:0 31 -keyint_min:v:0 180 -profile:v:0 high -c:a:0 copy -metadata:s:a:0 language=eng -disposition:a:0 default -max_delay 5000000 -avoid_negative_ts disabled -f segment -map_metadata -1 -map_chapters -1 -segment_format mpegts -segment_list "/var/lib/emby/transcoding-temp/DA6020.m3u8" -segment_list_type m3u8 -segment_time 3 -segment_start_number 0 -individual_header_trailer 0 -write_header_trailer 0 -segment_write_temp 1 "/var/lib/emby/transcoding-temp/DA6020_%d.ts"

I'm not sure whether it will work, though. The opencl tonemap filter is different from the regular one and I'm not sure whether the opencl-qsv mapping is working in normal ffmpeg for 10bit

FCLC commented 2 years ago

sudo cat /sys/kernel/debug/dri/0/gt/uc/huc_info

sudo cat /sys/kernel/debug/dri/0/gt/uc/guc_info

This was the first I had done and why I said that both are disabled

Hadn't realized that that was how you had confirmed it.

Once we have the same commands and same source files, that should make testing much easier

Edit: just saw the command.

I'll see how much of it is portable/if it's portable.

softworkz commented 2 years ago

Probably it's better with a VAAPI command because we have less modifications there:

/opt/emby-server/bin/ffmpeg -y -copyts -start_at_zero -f mp4 -c:v:0 hevc -hwaccel:v:0 vaapi -hwaccel_device:v:0 /dev/dri/renderD128 -hwaccel_output_format:v:0 vaapi -i "/home/jay/Videos/HDR/Sony Swordsmith HDR UHD 4K Demo.mp4" -filter_complex "[0:0]scale_vaapi@f1=w=640:h=360,tonemap_vaapi@f2=format=nv12:matrix=bt709:primaries=bt709:transfer=bt709[f2_out0]" -map [f2_out0] -map 0:1 -sn -c:v:0 h264_vaapi -b:v:0 1808000 -g:v:0 180 -maxrate:v:0 1808000 -bufsize:v:0 3616000 -sc_threshold:v:0 0 -keyint_min:v:0 180 -profile:v:0 high -level:v:0 3.1 -c:a:0 copy -metadata:s:a:0 language=eng -disposition:a:0 default -max_delay 5000000 -avoid_negative_ts disabled -f segment -map_metadata -1 -map_chapters -1 -segment_format mpegts -segment_list "/var/lib/emby/transcoding-temp/794DC8.m3u8" -segment_list_type m3u8 -segment_time 3 -segment_start_number 0 -individual_header_trailer 0 -write_header_trailer 0 "/var/lib/emby/transcoding-temp/794DC8_%d.ts"

Which does this:

FCLC commented 2 years ago

Fair point, vaapi first, then QSV?

Also, I need to head soon, but to summarize what we think we know so far:

Alderlake: vaapi has issues prior to kernel 5.18rc1

Qsv has issues in 5.16, 5.17 and 5.18rc1

Tiger lake:

Vaapi works fine as of at least 5.8 Ubuntu, but does not work as of 5.10 Debian

Same applies to QSV

softworkz commented 2 years ago

Have you tried with GuC/HuC off?

FCLC commented 2 years ago

Also, since we have the location of the emby binary, we could summon it directly.

Alternatively would you be able to build ffmpeg from source?

softworkz commented 2 years ago

Alternatively would you be able to build ffmpeg from source?

sure, but why?

softworkz commented 2 years ago

Also, since we have the location of the emby binary, we could summon it directly.

It's a bit tricky. You need to run the 'ffmpeg-emby' stub next to the binary, because this will setup all according to the context of the package (which contains its own versions of libva, iHD, etc..)

FCLC commented 2 years ago

Alternatively would you be able to build ffmpeg from source?

sure, but why?

If I have time I'm going to dig through the mailing lists and see anything related that might not be in the 5.0.x release yet.

Speaking of which, is emby on the 5.x ffmpeg branch? Or still on 4.x?

softworkz commented 2 years ago

It's not a matter of which exact command and it's not a matter of which ffmpeg branch. Either it's working or not (when both, decoding and encoding are done in hw).

FCLC commented 2 years ago

Have you tried with GuC/HuC off?

AFK currently, replying from a cab 😅

softworkz commented 2 years ago

OMG - the very one and only question and you didn't try....

softworkz commented 2 years ago

What I can say for sure is that this is something that must have been introduced just very recently. Otherwise we would have had reports about it before.

FCLC commented 2 years ago

It's not a matter of which exact command and it's not a matter of which ffmpeg branch.

Either it's working or not (when both, decoding and encoding are done in hw).

I'm trying to control for as many variables and have as much data as possible to narrow down the issue.

Part of my exhaustive questioning is because I noticed that the hardware, even within different versions of using the same hardware blocks, reacted differently in its crash behaviour. See the vpp_scale command above for example. Still using Qsv encode and decode, failing the same way as other Qsv encode and decode blocks on the same file, and the frame rate command it's executing is actually a no-op because the footage is already 60p. Yet it failed almost instantly

FCLC commented 2 years ago

Will be back home in a few, picking someone up

softworkz commented 2 years ago

Part of my exhaustive questioning is because I noticed that the hardware, even within different versions of using the same hardware blocks, reacted differently in its crash behaviour. See the vpp_scale command above for example. Still using Qsv encode and decode, failing the same way as other Qsv encode and decode blocks on the same file, and the frame rate command it's executing is actually a no-op because the footage is already 60p. Yet it failed almost instantly

Don't open up so many different dimensions at the same time. You will only confuse yourself and loose focus. I had to actually skip reading many of the posts at the beginning because every second message was about something different and I wasn't able to sort it (mentally) in a reasonable way. Better focus on something small and simple, stick to it and try under different conditions.

FCLC commented 2 years ago

running ffffmpeg -hwaccel qsv -c:v hevc_qsv -hwaccel_output_format qsv -i hdr_source.mp4 -vf 'hwupload=extra_hw_frames=64,format=qsv' -c:v hevc_qsv -b:v 10M -profile:v main10 output/output/sdr_out.mp4 -y

Does seem to work fine. It can also be interrupted without issue.

Logs:

sudo cat /etc/modprobe.d/i915.conf 
options i915 enable_guc=0

 sudo cat /sys/kernel/debug/dri/0/gt/uc/guc_info
GuC firmware: i915/tgl_guc_69.0.3.bin
    status: RUNNING
    version: wanted 69.0, found 69.0
    uCode: 342912 bytes
    RSA: 256 bytes

GuC status 0x8003f0ec:
    Bootrom status = 0x76
    uKernel status = 0xf0
    MIA Core status = 0x3

Scratch registers:
     0:     0x0
     1:     0x163fdf
     2:     0x40000
     3:     0x4000
     4:     0x40
     5:     0x2ec8
     6:     0x4680000c
     7:     0x0
     8:     0x0
     9:     0x0
    10:     0x0
    11:     0x0
    12:     0x0
    13:     0x0
    14:     0x0
    15:     0x0

GuC log relay not created

sudo cat /sys/kernel/debug/dri/0/gt/uc/huc_info
HuC firmware: i915/tgl_huc_7.9.3.bin
    status: RUNNING
    version: wanted 7.9, found 7.9
    uCode: 589504 bytes
    RSA: 256 bytes
HuC status: 0x00090001

softworkz commented 2 years ago

Why does it say "RUNNING"?

FCLC commented 2 years ago

Why does it say "RUNNING"?

Frankly I'm not certain on this one. but from this: https://01.org/linuxgraphics/downloads/firmware?langredirect=1 it looks that for ADL-P and above as of 5.14 it's automatically enabled when available

I'm on ADL-S, which does not have it on automatically.

We'd have to have one of the Intel Engineers comment.

here's the relevant support matric from the PDF:

softworkz commented 2 years ago

Why does it say "RUNNING"?

Frankly I'm not certain on this one.

Even when enabled by default, it should get disabled when specifying 0

Did you reboot after changing the setting?

FCLC commented 2 years ago

Why does it say "RUNNING"?

Frankly I'm not certain on this one.

Even when enabled by default, it should get disabled when specifying 0

Did you reboot after changing the setting?

yes I did.

By doing so, the issue also seems to have gone away.

Purely a theory:

Perhaps under normal circumstances the firmware is running, but only loaded when it's specialty features are needed?

but when force loaded the firmware causes conflicts with the primary driver?

softworkz commented 2 years ago

HuC is required for low-power encoding and bitrate control, but low-power encoding is not what you normally get with all those ffmpeg commands. HuC is only mandatory with JSL and EHL, because these only support low-power encoding

Perhaps under normal circumstances the firmware is running, but only loaded when it's specialty features are needed? but when force loaded the firmware causes conflicts with the primary driver?

I can't imagine that. Either it gets loaded or not.

You can use dmesg | grep HuC to get more information and try to find out the difference.

FCLC commented 2 years ago

I can't imagine that. Either it gets loaded or not.

[ +0.002855] i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_69.0.3.bin version 69.0 [ +0.000002] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9 [ +0.013152] i915 0000:00:02.0: [drm] HuC authenticated [ +0.000000] i915 0000:00:02.0: [drm] GuC submission disabled [ +0.000001] i915 0000:00:02.0: [drm] GuC SLPC disabled [ +0.000705] i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized

softworkz commented 2 years ago

And when you revert the kernel option change?

FCLC commented 2 years ago

And when you revert the kernel option change?

as in

sudo cat /etc/modprobe.d/i915.conf 
options i915 enable_guc=0

?

that was prior to reboot.

softworkz commented 2 years ago

I mean back to the state before where you were seeing the gpu errors.

intel / media-driver