intel / media-driver

Intel Graphics Media Driver to support hardware decode, encode and video processing.
https://github.com/intel/media-driver/wiki
Other
1k stars 347 forks source link

[Bug]: HEVC decoding fails on DG1 when using upstream kernel instead of Intel DKMS #1415

Closed eero-t closed 7 months ago

eero-t commented 2 years ago

Which component impacted?

Decode

Is it regression? Good in old configuration?

No response

What happened?

Use-cases

Expected outcome

Both of above do transcoding at hundreds of FPS, like is the case with TGL iGPU, with exactly the same setup. Or if I change the input to H.264 one.

Actual outcome

CONFIGURE LOADER: required implementation: hw CONFIGURE LOADER: required implementation mfxAccelerationMode: MFX_ACCEL_MODE_VIA_VAAPI libva info: VA-API version 1.14.0 libva info: User environment variable requested driver 'iHD' libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so libva info: Found init function vaDriverInit_1_14 libva info: va_openDriver() returns 0 libva info: VA-API version 1.14.0 libva info: User environment variable requested driver 'iHD' libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so libva info: Found init function __vaDriverInit_1_14 libva info: va_openDriver() returns 0 libva info: VA-API version 1.14.0 libva info: User environment variable requested driver 'iHD' libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so libva info: Found init function vaDriverInit_1_14 libva info: va_openDriver() returns 0 libva info: VA-API version 1.14.0 libva info: User environment variable requested driver 'iHD' libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so libva info: Found init function __vaDriverInit_1_14 libva info: va_openDriver() returns 0 libva info: VA-API version 1.14.0 libva info: User environment variable requested driver 'iHD' libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so libva info: Found init function __vaDriverInit_1_14 libva info: va_openDriver() returns 0 Session 0: Loaded Library configuration: Version: 2.7 ImplName: mfx-gen Adapter number : 0 Adapter type: integrated DRMRenderNodeNum: 128 Used implementation number: 0 Loaded modules: 0: /usr/local/lib/libmfxhw64.so.1.35 1: /usr/local/lib/libmfx-gen.so.1.2.7

Pipeline surfaces number (DecPool): 10 Input video: HEVC Output video: AVC

Session 0 was NOT joined with other sessions

Transcoding started

[ERROR], sts=MFX_ERR_ABORTED(-12), PutBS, Encode: SyncOperation failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:2112

[ERROR], sts=MFX_ERR_ABORTED(-12), Transcode, PutBS failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:2068

[ERROR], sts=MFX_ERR_ABORTED(-12), Run, CTranscodingPipeline::Run::Transcode() [0x55ba117e6c90] failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:4677

session 0 [0x55ba117e6c90] failed with status MFX_ERR_ABORTED shutting down the application...

session [0x55ba117e6c90] m_bForceStop is set

Transcoding finished

Common transcoding time is 2.88576 sec

*** session 0 [0x55ba117e6c90] FAILED (MFX_ERR_ABORTED) 2.88555 sec, 4 frames, 1.386 fps -i::h265 /media/GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null


The test FAILED

[ERROR], sts=MFX_ERR_ABORTED(-12), main, transcode.ProcessResult failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/sample_multi_transcode.cpp:1561


I do not know whether this is a regression.  There have been too many issues to say for sure whether it's ever worked on 0x4905 device.

### What's the usage scenario when you are seeing the problem?

Transcode for media delivery

### What impacted?

_No response_

### Debug Information

**Setup**

* GPU: DG1 (0x4905)
* Ubuntu 20.04.4 distro
* drm-tip 5.18 kernel
* media stack components build from latest release tags (as of today):
   - libva:    2.14.0
   - GMMlib:   intel-gmmlib-22.1.3
   - Media:    intel-media-22.4.2
   - MediaSDK: intel-mediasdk-22.4.2
   - oneVPL:   v2022.1.3
   - VPL-GPU:  intel-onevpl-22.4.2
   - FFmpeg:   n5.0.1

**VA-info**

libva info: VA-API version 1.14.0 libva info: User environment variable requested driver 'iHD' libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so libva info: Found init function __vaDriverInit_1_14 libva info: va_openDriver() returns 0 vainfo: VA-API version: 1.14 (libva 2.12.0) vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 22.4.2 (d7a1feb) vainfo: Supported profile and entrypoints VAProfileNone : VAEntrypointVideoProc VAProfileNone : VAEntrypointStats VAProfileMPEG2Simple : VAEntrypointVLD VAProfileMPEG2Simple : VAEntrypointEncSlice VAProfileMPEG2Main : VAEntrypointVLD VAProfileMPEG2Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointVLD VAProfileH264Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointFEI VAProfileH264Main : VAEntrypointEncSliceLP VAProfileH264High : VAEntrypointVLD VAProfileH264High : VAEntrypointEncSlice VAProfileH264High : VAEntrypointFEI VAProfileH264High : VAEntrypointEncSliceLP VAProfileVC1Simple : VAEntrypointVLD VAProfileVC1Main : VAEntrypointVLD VAProfileVC1Advanced : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointEncPicture VAProfileH264ConstrainedBaseline: VAEntrypointVLD VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice VAProfileH264ConstrainedBaseline: VAEntrypointFEI VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP VAProfileHEVCMain : VAEntrypointVLD VAProfileHEVCMain : VAEntrypointEncSlice VAProfileHEVCMain : VAEntrypointFEI VAProfileHEVCMain : VAEntrypointEncSliceLP VAProfileHEVCMain10 : VAEntrypointVLD VAProfileHEVCMain10 : VAEntrypointEncSlice VAProfileHEVCMain10 : VAEntrypointEncSliceLP VAProfileVP9Profile0 : VAEntrypointVLD VAProfileVP9Profile0 : VAEntrypointEncSliceLP VAProfileVP9Profile1 : VAEntrypointVLD VAProfileVP9Profile1 : VAEntrypointEncSliceLP VAProfileVP9Profile2 : VAEntrypointVLD VAProfileVP9Profile2 : VAEntrypointEncSliceLP VAProfileVP9Profile3 : VAEntrypointVLD VAProfileVP9Profile3 : VAEntrypointEncSliceLP VAProfileHEVCMain12 : VAEntrypointVLD VAProfileHEVCMain12 : VAEntrypointEncSlice VAProfileHEVCMain422_10 : VAEntrypointVLD VAProfileHEVCMain422_10 : VAEntrypointEncSlice VAProfileHEVCMain422_12 : VAEntrypointVLD VAProfileHEVCMain422_12 : VAEntrypointEncSlice VAProfileHEVCMain444 : VAEntrypointVLD VAProfileHEVCMain444 : VAEntrypointEncSliceLP VAProfileHEVCMain444_10 : VAEntrypointVLD VAProfileHEVCMain444_10 : VAEntrypointEncSliceLP VAProfileHEVCMain444_12 : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointEncSliceLP VAProfileHEVCSccMain10 : VAEntrypointVLD VAProfileHEVCSccMain10 : VAEntrypointEncSliceLP VAProfileHEVCSccMain444 : VAEntrypointVLD VAProfileHEVCSccMain444 : VAEntrypointEncSliceLP VAProfileAV1Profile0 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointEncSliceLP


**Notes**

There are no GPU hangs.  Kernel driver output / settings:

dmesg |grep i915

[ 0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=3 i915.force_probe=4905 ro [ 0.026206] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=3 i915.force_probe=4905 ro [ 2.582581] i915 0000:03:00.0: [drm] VT-d active for gfx access [ 2.582586] fb0: switching to i915 from EFI VGA [ 2.582684] i915 0000:03:00.0: vgaarb: deactivate vga console [ 2.582707] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000000fb800000 [ 2.582708] i915 0000:03:00.0: [drm] Local memory available: 0x00000000fb800000 [ 2.597310] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 2.600584] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2) [ 2.667756] i915 0000:03:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1 [ 2.667759] i915 0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9 [ 2.673370] i915 0000:03:00.0: [drm] HuC authenticated [ 2.673789] i915 0000:03:00.0: [drm] GuC submission enabled [ 2.673790] i915 0000:03:00.0: [drm] GuC SLPC enabled [ 2.674046] i915 0000:03:00.0: [drm] GuC RC: enabled



### Do you want to contribute a patch to fix the issue?

No.
eero-t commented 2 years ago

I've also tested Sysman functionality and simple OpenCL programs. Those work fine, so in general drm-tip kernel seems to work fine.

Btw. media-driver README still states following:

Media-driver requires special i915 kernel mode driver (KMD) version to support the following new platforms since upstream version of i915 KMD does not fully support them (pending patches upstream):

DG1/SG1 Alchemist(DG2)/ATSM

By default, media-driver builds against upstream i915 KMD and will miss support for the platforms listed above. To enable new platforms which require special i915 KMD and specify ENABLE_PRODUCTION_KMD=ON (default: OFF) build configuration option.

Although AFAIK that has not been true for DG1 for over a half a year, since this media-driver commit: https://github.com/intel/media-driver/commit/db5a8706bf9214ef12542eaab1896dacfdedfceb

And DG2 / ATS-M support being already in public kernel (for some of their variants, and requiring force-probing for now).

XinfengZhang commented 2 years ago

from 22'Q1 release, media_driver does not support DG1 with ENABLE_PRODUCTION_KMD=ON anymore now, this option just support DG2 with suitable kernel support (https://github.com/intel-gpu/intel-gpu-i915-backports/tree/ubuntu/main) will update the document, if you still want DG1 against https://github.com/intel-gpu/kernel, please use 21'Q4 release.

eero-t commented 2 years ago

The note about outdated DG1 info in README was just FYI.

I am using ENABLE_PRODUCTION_KMD=OFF with public kernel, and that fails for me on DG1 with HEVC.

(There may have been 3D + AVC transode running at the same time in the backend while I was running this test-case, but that should not have broken HEVC as dmesg does not show any errors.)

Xiaogangli-intel commented 2 years ago

Hi @eero-t, I noticed i915.enable_guc=3 in your kernel parameters, could you try i915.enable_guc=2? Seems GuC submission doesn't work on DG1.

Xiaogangli-intel commented 2 years ago

Hi @eero-t , media still have some issues on drm-tip KMD for DG1. Could you please try this KMD at https://github.com/intel-gpu/intel-gpu-i915-backports, and need to build media driver with ENABLE_PRODUCTION_KMD=ON, also i915.enable_guc=2 in kernel boot parameters.

dvrogozh commented 2 years ago

At the moment there are 2 possible ways to setup DG1:

  1. Use vanilla kernel (or drm-tip). DG1 support is not still finalized in here, user should use i915.force_probe=* (or specific device id) to enable. You don't need any special options of media driver build for that.
  2. Use custom kernel (or rather kernel module) which @Xiaogangli-intel suggests above, https://github.com/intel-gpu/intel-gpu-i915-backports. Use ENABLE_PRODUCTION_KMD=ON to build media-driver.

That's up to the user to decide which kernel to use. However, in both cases, user is NOT supposed to adjust i915.enable_guc option in any way. This is a very risky option and user should clear understand why he is trying to change it.

@eero-t : I strongly suggest to drop i915.enable_guc from cmdline and try again. I vaguely recall you had some issue because of setting this option before. Hope this will help. If not, then you've found LGTM issue for vanilla kernel which media team will need to look at.

eero-t commented 2 years ago

Good catch. I'll test upstream kernel without the GuC option tomorrow, and report back.

(I've intended to clean that out, but had forgotten to do it for all kernel configs on all machines.)

eero-t commented 2 years ago

@dvrogozh GuC scheduling is enabled by public (yesterday) "drm-tip" kernel, even when it's not forced:

# dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.force_probe=4905 ro
[    0.026212] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.force_probe=4905 ro
[    2.081413] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    2.081417] fb0: switching to i915 from EFI VGA
[    2.081679] i915 0000:03:00.0: vgaarb: deactivate vga console
[    2.081714] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000000fb800000
[    2.081715] i915 0000:03:00.0: [drm] Local memory available: 0x00000000fb800000
[    2.094978] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    2.099214] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2)
[    2.165804] i915 0000:03:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1
[    2.165807] i915 0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
[    2.171291] i915 0000:03:00.0: [drm] HuC authenticated
[    2.171568] i915 0000:03:00.0: [drm] GuC submission enabled
[    2.171569] i915 0000:03:00.0: [drm] GuC SLPC enabled
[    2.171828] i915 0000:03:00.0: [drm] GuC RC: enabled
[    2.207306] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0

And the same issue persists.

Note: I forgot to mention earlier, but in case it matters, all of these nodes have 2 0x4905 DG1 GPUs. Limiting media-driver devfs visibility just to first one (with Docker) did not change anything though.

eero-t commented 2 years ago

@Xiaogangli-intel even with GuC scheduling explicitly disabled for drm-tip:

# dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=2 i915.force_probe=4905 ro
[    0.026191] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=2 i915.force_probe=4905 ro
[    2.084816] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    2.084820] fb0: switching to i915 from EFI VGA
[    2.084891] i915 0000:03:00.0: vgaarb: deactivate vga console
[    2.084913] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000000fb800000
[    2.084914] i915 0000:03:00.0: [drm] Local memory available: 0x00000000fb800000
[    2.096936] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    2.100697] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2)
[    2.170110] i915 0000:03:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1
[    2.170113] i915 0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
[    2.185450] i915 0000:03:00.0: [drm] HuC authenticated
[    2.185451] i915 0000:03:00.0: [drm] GuC submission disabled
[    2.185452] i915 0000:03:00.0: [drm] GuC SLPC disabled
[    2.220340] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0

Media driver fails:

sample_multi_transcode -i::h265 /media/GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null
Multi Transcoding Sample Version 8.4.27.0

CONFIGURE LOADER: required implementation: hw 
CONFIGURE LOADER: required implementation mfxAccelerationMode: MFX_ACCEL_MODE_VIA_VAAPI 
libva info: VA-API version 1.14.0
...
libva info: User environment variable requested driver 'iHD'
libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_14
libva info: va_openDriver() returns 0
Session 0:
Loaded Library configuration: 
    Version: 2.7 
    ImplName: mfx-gen 
    Adapter number : 0 
    Adapter type: integrated
    DRMRenderNodeNum: 128 
Used implementation number: 0 
Loaded modules:
   0: /usr/local/lib/libmfxhw64.so.1.35 
   1: /usr/local/lib/libmfx-gen.so.1.2.7 

Pipeline surfaces number (DecPool): 10
Input  video: HEVC
Output video: AVC 

Session 0 was NOT joined with other sessions

Transcoding started

[ERROR], sts=MFX_ERR_ABORTED(-12), PutBS, Encode: SyncOperation failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:2112

[ERROR], sts=MFX_ERR_ABORTED(-12), Transcode, PutBS failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:2068

[ERROR], sts=MFX_ERR_ABORTED(-12), Run, CTranscodingPipeline::Run::Transcode() [0x55b1e65abc90] failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:4677

 session 0 [0x55b1e65abc90] failed with status MFX_ERR_ABORTED shutting down the application...

session [0x55b1e65abc90] m_bForceStop is set

Transcoding finished
eero-t commented 2 years ago

When GuC scheduling is explicitly disabled, there's also a GPU hang:

[   65.151632] i915 0000:03:00.0: [drm] Resetting vcs1 for preemption time out
[   65.151688] i915 0000:03:00.0: [drm] sample_multi_tr[5533] context reset due to GPU hang
[   65.160360] i915 0000:03:00.0: [drm] GPU HANG: ecode 12:4:28fffffd, in sample_multi_tr [5533]

See: gpu-hang.txt

media still have some issues on drm-tip KMD for DG1.

Could you give pointer to more info?

Could you please try this KMD at https://github.com/intel-gpu/intel-gpu-i915-backports, and need to build media driver with ENABLE_PRODUCTION_KMD=ON, also i915.enable_guc=2 in kernel boot parameters.

Sorry, but I'm not interested about public media-driver on backport kernel, only with what's going to upstream.

eero-t commented 2 years ago

Btw. 1-2 months ago when I was testing internal KMD + UMD versions, I was seeing some instances failing with OneVPL HEVC transcode, when trying to do many parallel transcodes on DG2. I did not debug it further (container instances were changing too fast), but I'm now wondering whether it's related to HEVC issues here with public KMD+UMD versions on DG1. Are there known HEVC issues for DG2 too?

Xiaogangli-intel commented 2 years ago

Hi @eero-t, I noticed the hang issue of HEVC decode. If you really mind using backport kernel, maybe we have to sync with KMD to check the progress of DG1 patches upstreaming.

eero-t commented 2 years ago

DG1 has been enabled in upstream kernel (not just drm-tip) for a long time: https://github.com/torvalds/linux/blob/master/include/drm/i915_pciids.h#L630

But kernel docs RFC section still mentions several items: https://www.kernel.org/doc/html/latest/gpu/rfc/index.html

I've asked whether they've landed already upstream (in Linus' tree i.e. should docs have been moved out of RFC section), not just in public drm-tip that I was testing (and with which I was seeing the issues).

eero-t commented 2 years ago

According to kernel side, status specified in RFC docs applies both to public upstream and drm-tip. I.e. there are still significant gaps in kernel i915 dGPU support, although GuC scheduling has already been enabled by default.

PS. I just tested latest media driver stack releases, and e.g. FFmpeg still gives 2 FPS with drm-tip (instead of the expected hundreds of FPS). I haven't updated the kernel side though (will probably do that late summer, when 5.19 nears release).

eero-t commented 2 years ago

I tested yesterday's drm-tip 5.19-rc7 (and few days earlier 5.19-rc6) on DG1, and things have gone downhill. Instead of 2 FPS HEVC decode, there are lots of failures now with FFmpeg / VA-API:

Input #0, hevc, from '/media/GTAV_1920x1080_60_yuv420p.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Main), yuv420p(tv), 1920x1080, 60 fps, 60 tbr, 1200k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (native) -> h264 (h264_vaapi))
Press [q] to stop, [?] for help
[hevc @ 0x560f76ef5700] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76ef5700] hardware accelerator failed to decode picture
[hevc @ 0x560f76fa7080] Could not find ref with POC 0
[hevc @ 0x560f76fa7080] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76fa7080] hardware accelerator failed to decode picture
[hevc @ 0x560f76fb8840] Could not find ref with POC 1
[hevc @ 0x560f76fb8840] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76fb8840] hardware accelerator failed to decode picture
[hevc @ 0x560f76fca040] Could not find ref with POC 6
[hevc @ 0x560f76fca040] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76fca040] hardware accelerator failed to decode picture
[hevc @ 0x560f76fdb840] Could not find ref with POC 4
[hevc @ 0x560f76fdb840] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76fdb840] hardware accelerator failed to decode picture

(No errors in dmesgs though.)

When using FFmpeg with QSV instead of VA-API, it fails immediately:

Input #0, hevc, from '/media/GTAV_1920x1080_60_yuv420p.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Main), yuv420p(tv), 1920x1080, 60 fps, 60 tbr, 1200k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (native) -> h264 (h264_qsv))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf59.16.100
  Stream #0:0: Video: h264, nv12(tv, progressive), 1920x1080, q=2-31, 1000 kb/s, 60 fps, 60 tbn
    Metadata:
      encoder         : Lavc59.18.100 h264_qsv
    Side data:
      cpb: bitrate max/min/avg: 0/0/1000000 buffer size: 0 vbv_delay: N/A
[h264_qsv @ 0x561d6d5488c0] Unknown FrameType, set pict_type to AV_PICTURE_TYPE_NONE.
[h264_qsv @ 0x561d6d5488c0] Error during encoding: unknown error (-21)
Video encoding failed
Conversion failed!

However, exactly the same drm-tip kernel, user-space [1] and test-case still work fine on TGL (with perf in hundreds of FPS).

[1] User-space components:

TGL dmesg content:

$ dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=2 ro
[    0.037729] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=2 ro
[    2.621257] i915 0000:00:02.0: [drm] VT-d active for gfx access
[    2.621363] i915 0000:00:02.0: vgaarb: deactivate vga console
[    2.621412] i915 0000:00:02.0: [drm] Using Transparent Hugepages
[    2.623817] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    2.625315] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/tgl_dmc_ver2_12.bin (v2.12)
[    3.298064] i915 0000:00:02.0: [drm] failed to retrieve link info, disabling eDP
[    3.414350] i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.1.1.bin version 70.1
[    3.414352] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
[    3.427536] i915 0000:00:02.0: [drm] HuC authenticated
[    3.427538] i915 0000:00:02.0: [drm] GuC submission disabled
[    3.427538] i915 0000:00:02.0: [drm] GuC SLPC disabled
[    3.504681] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
[    3.511744] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[    3.683072] fbcon: i915drmfb (fb0) is primary device
[    3.780445] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device

DG1 dmesg content:

$ dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.force_probe=4905 ro
[    0.025971] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.force_probe=4905 ro
[    2.174692] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    2.174792] i915 0000:03:00.0: vgaarb: deactivate vga console
[    2.174819] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000000fb800000
[    2.174820] i915 0000:03:00.0: [drm] Local memory available: 0x00000000fb800000
[    2.189355] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    2.191620] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2)
[    2.258663] i915 0000:03:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1
[    2.258666] i915 0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
[    2.263872] i915 0000:03:00.0: [drm] HuC authenticated
[    2.264158] i915 0000:03:00.0: [drm] GuC submission enabled
[    2.264160] i915 0000:03:00.0: [drm] GuC SLPC enabled
[    2.264414] i915 0000:03:00.0: [drm] GuC RC: enabled
[    2.301517] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0
[    2.302089] i915 0000:0a:00.0: [drm] VT-d active for gfx access
[    2.302121] i915 0000:0a:00.0: [drm] Local memory IO size: 0x00000000fb800000
[    2.302122] i915 0000:0a:00.0: [drm] Local memory available: 0x00000000fb800000
[    2.321256] i915 0000:0a:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2)
[    2.327096] fbcon: i915drmfb (fb0) is primary device
[    2.373506] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[    2.391770] i915 0000:0a:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1
[    2.391772] i915 0000:0a:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
[    2.398253] i915 0000:0a:00.0: [drm] HuC authenticated
[    2.398673] i915 0000:0a:00.0: [drm] GuC submission enabled
[    2.398674] i915 0000:0a:00.0: [drm] GuC SLPC enabled
[    2.398985] i915 0000:0a:00.0: [drm] GuC RC: enabled
[    2.411403] [drm] Initialized i915 1.6.0 20201103 for 0000:0a:00.0 on minor 1
[    2.414748] i915 0000:0a:00.0: [drm] Cannot find any crtc or sizes
[    2.415142] i915 0000:0a:00.0: [drm] Cannot find any crtc or sizes

I.e. the main differences are there being 2x DG1 devices, with GuC scheduling being enabled (by default), and THP being enabled only on TGL for some reason, although both have VT-d active.

eero-t commented 2 years ago

Tested media stack components build from latest release tags (Ubuntu 22.04 based container):

And both the FFmpeg VA-API and OneVPL decoding failures are still there, both with slightly older drm-tip v6.0-rc3 kernel, and v6.0-rc5 from yesterday.

OneVPL / MFX error message has changed to match what FFmpeg / VA-API was reporting:

Loaded Library configuration: 
    Version: 2.7 
    ImplName: mfx-gen 
    Adapter number : 0 
    Adapter type: integrated
    DRMRenderNodeNum: 128 
Used implementation number: 0 
Loaded modules:
   0: /usr/local/lib/libmfxhw64.so.1.35 
   1: /usr/local/lib/libmfx-gen.so.1.2.7 

Pipeline surfaces number (DecPool): 10
Input  video: HEVC
Output video: AVC 

Session 0 was NOT joined with other sessions

Transcoding started

[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), Transcode, Decode<One|Last>Frame failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:1933

[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), Run, CTranscodingPipeline::Run::Transcode() [0x556090093fa0] failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:4868

 session 0 [0x556090093fa0] failed with status MFX_ERR_DEVICE_FAILED shutting down the application...
eero-t commented 2 years ago

drm-tip kernel dmesg shows this on startup, but I guess this is related just to error reporting, not media:

[    2.477122] i915 0000:0a:00.0: [drm] *ERROR* Zero GuC log crash dump size!
[    2.477124] i915 0000:0a:00.0: [drm] *ERROR* Zero GuC log debug size!
[    2.478087] i915 0000:0a:00.0: [drm] GuC error state capture buffer maybe too small: 2097152 < 2360316 (min = 786772)
[    2.482243] i915 0000:0a:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1.1
eero-t commented 2 years ago

Things still fail with latest "drm-tip" (6.0-rc6) from today, and latest media-driver release:

EDIT: VAAPI init failure was due to kernel FW loading issue: https://gitlab.freedesktop.org/drm/intel/-/issues/6895

The error is now:

[ERROR], sts=MFX_ERR_NULL_PTR(-2), Init, m_fSource pointer is NULL at /home/nobody/source/oneVPL/tools/legacy/sample_common/src/sample_utils.cpp:682

[ERROR], sts=MFX_ERR_NULL_PTR(-2), Init, reader->Init failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/sample_multi_transcode.cpp:528

[ERROR], sts=MFX_ERR_NULL_PTR(-2), main, transcode.Init failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/sample_multi_transcode.cpp:1565
nyanmisaka commented 2 years ago

Any update on this? I've got a DG1 80EU and it fails decoding any video with VAAPI/QSV through ffmpeg cli. But everything works just fine on Windows.

eero-t commented 2 years ago

Things still fail with latest "drm-tip" (6.0-rc7) from yesterday, with a matching FW (GuC: 70.5.1, HuC: 7.9.3), and latest media stack releases:

Output from OneVPL tool:

$ sample_multi_transcode -i::h265 /media/GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null
...
Loaded Library configuration: 
    Version: 2.7 
    ImplName: mfx-gen 
    Adapter number : 0 
    Adapter type: integrated
    DRMRenderNodeNum: 128 
Used implementation number: 0 
Loaded modules:
   0: /usr/local/lib/libmfxhw64.so.1.35 
   1: /usr/local/lib/libmfx-gen.so.1.2.7 

Pipeline surfaces number (DecPool): 10
Input  video: HEVC
Output video: AVC 

Session 0 was NOT joined with other sessions
Transcoding started

[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), Transcode, Decode<One|Last>Frame failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:1944

[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), Run, CTranscodingPipeline::Run::Transcode() [0x5555799dfdf0] failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:4904

 session 0 [0x5555799dfdf0] failed with status MFX_ERR_DEVICE_FAILED shutting down the application...

Output from FFmpeg / VA-API:

$ ffmpeg -y -an -loglevel verbose -hwaccel vaapi -hwaccel_output_format vaapi -i /media/GTAV_1920x1080_60_yuv420p.h265 -c:v h264_vaapi -f null
...
Input #0, hevc, from '/media/GTAV_1920x1080_60_yuv420p.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Main), 1 reference frame, yuv420p(tv, left), 1920x1080, 60 fps, 60 tbr, 1200k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (native) -> h264 (h264_vaapi))
Press [q] to stop, [?] for help
[hevc @ 0x56401e78b500] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x56401e78b500] hardware accelerator failed to decode picture
[hevc @ 0x56401e82a180] Could not find ref with POC 0
[hevc @ 0x56401e82a180] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x56401e82a180] hardware accelerator failed to decode picture
[hevc @ 0x56401e7f15c0] Could not find ref with POC 1
[hevc @ 0x56401e7f15c0] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x56401e7f15c0] hardware accelerator failed to decode picture
[hevc @ 0x56401e802e40] Could not find ref with POC 6
...

OneVPL does not show anything in dmesg, but FFmpeg does show GPU hangs:

[10735.331980] i915 0000:0a:00.0: [drm] GPU HANG: ecode 12:4:00000000, in ffmpeg [19118]
[10735.331984] i915 0000:0a:00.0: [drm] ffmpeg[19118] context reset due to GPU hang
[10741.805987] i915 0000:0a:00.0: [drm] GPU HANG: ecode 12:4:00000000, in ffmpeg [19118]
[10741.805992] i915 0000:0a:00.0: [drm] ffmpeg[19118] context reset due to GPU hang

E.g. running simple OpenCL program with latest public compute stack releases does not show any problems.

XinfengZhang commented 2 years ago

could you help to have a try with #1500

eero-t commented 2 years ago

could you help to have a try with #1500

Sure, but I'd like to see it first pass at least one of the CI tests... Currently they all fail for it?

eero-t commented 2 years ago

could you help to have a try with #1500

Sure, but I'd like to see it first pass at least one of the CI tests... Currently they all fail for it?

Tried it anyway. "Disable object capture for recoverable context" commit did not help, things fail like before.

gizahNL commented 2 years ago

Coming here after my issue report on onevpl-intel-gpu:

latest media-driver is completely unusable for me (I'm only interested in encoding). AVC and HEVC encoding both fail when using sample_encode program.

Up till commit 60001c60a1f13e23cb49278ea757a77c8d743674 (bisected) HEVC encoding works.

gizahNL commented 2 years ago

could you help to have a try with #1500

This does fix encoding both AVC and HEVC for me.

eero-t commented 2 years ago

could you help to have a try with #1500

This does fix encoding both AVC and HEVC for me.

@gizahNL You do not have issue with HEVC decode?

(I'm wondering whether my issue is be specific to my HEVC file, or more generic.)

gizahNL commented 2 years ago

could you help to have a try with #1500

This does fix encoding both AVC and HEVC for me.

@gizahNL You do not have issue with HEVC decode?

(I'm wondering whether my issue is be specific to my HEVC file, or more generic.)

No, decoding HEVC fails for me as well:

(encoded with sample_encode)

ffprobe ./out.h265           
ffprobe version 5.1.1-1ubuntu1 Copyright (c) 2007-2022 the FFmpeg developers
  built with gcc 12 (Ubuntu 12.2.0-1ubuntu1)
  configuration: --prefix=/usr --extra-version=1ubuntu1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librist --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --disable-sndio --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-libplacebo --enable-shared
  libavutil      57. 28.100 / 57. 28.100
  libavcodec     59. 37.100 / 59. 37.100
  libavformat    59. 27.100 / 59. 27.100
  libavdevice    59.  7.100 / 59.  7.100
  libavfilter     8. 44.100 /  8. 44.100
  libswscale      6.  7.100 /  6.  7.100
  libswresample   4.  7.100 /  4.  7.100
  libpostproc    56.  6.100 / 56.  6.100
Input #0, hevc, from './out.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Main), yuv420p(tv), 1920x1080 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 1200k tbn
sample_decode h265 -in ./out.h265 -o test.yuv
CONFIGURE LOADER: required implementation: hw 
CONFIGURE LOADER: required implementation mfxAccelerationMode: MFX_ACCEL_MODE_VIA_VAAPI 
libva info: VA-API version 1.15.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_15
libva info: va_openDriver() returns 0
libva info: VA-API version 1.15.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_15
libva info: va_openDriver() returns 0
Loaded Library configuration: 
    Version: 2.7 
    ImplName: mfx-gen 
    Adapter number : 0 
    Adapter type: integrated
    DRMRenderNodeNum: 128 
Used implementation number: 0 
Loaded modules:
   0: /usr/lib/x86_64-linux-gnu/libmfx-gen.so.1.2.7 
   1: /usr/lib/x86_64-linux-gnu/libmfxhw64.so.1.35 

libva info: VA-API version 1.15.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_15
libva info: va_openDriver() returns 0
Decoding Sample Version 8.4.27.0

Input video HEVC
Output format   NV12
Input:
  Resolution    1920x1088
  Crop X,Y,W,H  0,0,1920,1080
Output:
  Resolution    1920x1080
Frame rate  30.00
Memory type     system
MediaSDK impl       hw
MediaSDK version    2.7

Decoding started

[ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, SyncOperation fail or timeout at ./tools/legacy/sample_decode/src/pipeline_decode.cpp:1666
Frame number:    0, fps: 0.000, fread_fps: 0.000, fwrite_fps: 0.000
[ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, Unexpected error!! at ./tools/legacy/sample_decode/src/pipeline_decode.cpp:1922

[ERROR], sts=MFX_ERR_GPU_HANG(-21), main, Pipeline.RunDecoding failed at ./tools/legacy/sample_decode/src/sample_decode.cpp:861
eero-t commented 1 year ago

Still an issue with drm-tip v6.0 kernel, and following media stack from end of November:

FFmpeg transcode runs at <1/100th of the expected speed, and oneVPL transcode tools fails to:

*** session 0 [0x561608b7d8e0] FAILED (MFX_ERR_GPU_HANG) 3.84671 sec, 7 frames, 1.820 fps
-i::h265 /media/GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null -async 4 

PS. MFX_ERR_GPU_HANG error is odd, because dmesg does not show any i915 GPU hang...

eero-t commented 1 year ago

Tested drm-tip "v6.2-rc4" kernel (yesterday Git), and following media stack (latest releases as of today):

As dGPU support (without force-probing) was enabled in it (already in "v6.2-rc1"), I tested AVC & HEVC decode + encode also on DG2. Everything worked fine there, after enabling few extra things: https://gitlab.freedesktop.org/drm/intel/-/issues/7732

However, I could not get that same kernel binary to boot on the machine where DG1 is installed.

When I had earlier tested another build of the same kernel version on DG1, both AVC and HEVC decoding failed. However, that kernel did not have those extra options enabled, so I'm not sure whether that counts.

I did try slightly older drm-tip kernel on DG1 though, v6.1.0 one, with above listed media stack (and older one from 2022 May), and AVC decoding fails with that too, same as HEVC decoding:

[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), Transcode, Decode<One|Last>Frame failed at oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:1944`

I'll try later to get the new kernel config (with I915_PXP & MEI_GSC options) to boot also on DG1 machine, to get a setup comparable to the working DG2 setup.

eero-t commented 1 year ago

I did try slightly older drm-tip kernel on DG1 though, v6.1.0 one, with above listed media stack (and older one from 2022 May), and AVC decoding fails with that too, same as HEVC decoding

v6.1 drm-tip kernel says in dmesg that it did load HuC. Unlike v6.2-rc4 did, it does not complain about missing MEI modules, although it does not have DRM_I915_PXP and INTEL_MEI_PXP/_GSC options enabled.

@dvrogozh Is it still possible that HuC does not work properly without kernel MEI modules?

(I don't have easy way to build older drm-tip commits with new configs, as its commit IDs change constantly due to it being rebased to upstream.)

dvrogozh commented 1 year ago

@dvrogozh Is it still possible that HuC does not work properly without kernel MEI modules?

I did not work with upstream kernel for awhile, but my assumption is "yes". With 3 DKMS modules solution for ATS-M (i915, cse, vsec), i.e. following for example https://github.com/intel/media-delivery/blob/master/doc/intel-gpu-dkms.rst, we are looking for 2 messages to make sure that HUC properly works, one is that it's loaded, another that it's authenticated:

$ sudo dmesg | grep drm
[   18.909790] [drm] I915 BACKPORTED INIT
[   18.916368] i915 0000:4d:00.0: [drm] GT count: 1, enabled: 1
[   18.950174] i915 0000:4d:00.0: [drm] Bumping pre-emption timeout from 640 to 7500 on rcs'0.0 to allow slow compute pre-emption
[   18.963577] i915 0000:4d:00.0: [drm] Bumping pre-emption timeout from 640 to 7500 on ccs'0.0 to allow slow compute pre-emption
[   18.976534] i915 0000:4d:00.0: [drm] Bumping pre-emption timeout from 640 to 7500 on ccs'1.0 to allow slow compute pre-emption
[   18.976541] i915 0000:4d:00.0: [drm] Bumping pre-emption timeout from 640 to 7500 on ccs'2.0 to allow slow compute pre-emption
[   18.976545] i915 0000:4d:00.0: [drm] Bumping pre-emption timeout from 640 to 7500 on ccs'3.0 to allow slow compute pre-emption
[   19.007432] i915 0000:4d:00.0: [drm] Using Transparent Hugepages
[   19.027912] i915 0000:4d:00.0: [drm] Local memory available: 0x000000037a800000
[   19.069893] i915 0000:4d:00.0: [drm] GuC error state capture buffer maybe too small: 2097152 < 3737592 (min = 1245864)
[   19.087391] i915 0000:4d:00.0: [drm] GuC firmware i915/dg2_guc_70.4.1.bin version 70.4
[   19.104549] i915 0000:4d:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10
[   19.131153] i915 0000:4d:00.0: [drm] GuC submission enabled
[   19.137489] i915 0000:4d:00.0: [drm] GuC SLPC enabled
[   19.151998] i915 0000:4d:00.0: [drm] GuC RC: enabled
[   19.189177] [drm] Initialized i915 1.6.0 20201103 for 0000:4d:00.0 on minor 1
[   19.997817] i915 0000:4d:00.0: [drm] HuC authenticated
eero-t commented 1 year ago

That authentication message is there with drm-tip v6.1 kernel:

# dmesg | grep HuC
[    2.349771] i915 0000:03:00.0: [drm] HuC firmware i915/dg1_huc.bin version 7.9.3
[    2.355243] i915 0000:03:00.0: [drm] HuC authenticated
[    2.386533] i915 0000:0a:00.0: [drm] HuC firmware i915/dg1_huc.bin version 7.9.3
[    2.392380] i915 0000:0a:00.0: [drm] HuC authenticated

So unfortunately that's not explaining the problem. Maybe MEI modules are really needed only with v6.2...

gizahNL commented 1 year ago

However, I could not get that same kernel binary to boot on the machine where DG1 is installed.

I remember looking at the code that removed the force probing requirement and noticing that this was only done for DG2, not DG1

eero-t commented 1 year ago

Good to know, thanks!

I'm still using "i915.force_probe=4905" on that CML-S machine, but maybe there's a reason why DG1 is not (yet) enabled by default in (drm-tip) v6.2-rc...

(I actually got it to boot first time, but accidentally rebooted it before being able to test it, and it did not come up with that kernel any more on further reboots, only with drm-tip v6.1, and Ubuntu OEM kernel + i915 DKMS. :-/)

nyanmisaka commented 1 year ago

DG1 has lost the availability of all Media features in upstream kernels. Not just HEVC decoding.

DG1 Xe MAX 8086:4905 Linux 5.15, 6.0, 6.1, 6.2-rc6, drm-tip media-driver 23.1.0 ENABLE_PRODUCTION_KMD=OFF

Tested with FFmpeg:

DEC: h264_vaapi: Failed to end picture decode issue: 23 (internal decoding error) hevc_vaapi: Failed to end picture decode issue: 23 (internal decoding error) av1_vaapi: Segfault, Failed to end picture decode issue: 23 (internal decoding error)

ENC: h264_vaapi: < 2 fps hevc_vaapi: < 2 fps

VPP: scale_vaapi: Failed

COPY: hwupload: very slow

Also confirmed with intel-gpu-i915-backports + ENABLE_PRODUCTION_KMD=ON everything works fine.

Now that DG2 is already available in upstream. Can someone at Intel update the progress of DG1 patches upstreaming?

Or provide a list of patches instead of just the squashed commits in backports repo? It's really hard to tell which lines fixed the media issues on DG1.

gizahNL commented 1 year ago

We have DG1 Encoding working on 6.0 with intel-onevpl 22.6.1 and mediadriver 22.6.0 with https://github.com/intel/media-driver/pull/1500 applied as patch. Haven't tested with later revisions yet though.

nyanmisaka commented 1 year ago

We have DG1 Encoding working on 6.0 with intel-onevpl 22.6.1 and mediadriver 22.6.0 with #1500 applied as patch. Haven't tested with later revisions yet though.

The commit https://github.com/intel/media-driver/commit/f8812f26a35755714d6386df969b0e99fbab56ef is already in 23.1.0.

No issue in UMD for DG1 but in upstream KMD.

gizahNL commented 1 year ago

We have DG1 Encoding working on 6.0 with intel-onevpl 22.6.1 and mediadriver 22.6.0 with #1500 applied as patch. Haven't tested with later revisions yet though.

The commit f8812f2 is already in 23.1.0.

No issue in UMD for DG1 but in upstream KMD.

Encoding was definitely working for me with upstream 6.0 branch, not sure what revision. We're now on 6.0.19 but I haven't used the DG1 in a while in our testing.

nyanmisaka commented 1 year ago

Encoding was definitely working for me with upstream 6.0 branch, not sure what revision. We're now on 6.0.19 but I haven't used the DG1 in a while in our testing.

Even if encoding was working with some media-driver + public kernel combinations, it's not enough for our use case. We need the complete transcoding pipeline from decoding, VPP to encoding.

eero-t commented 1 year ago

DG1 has lost the availability of all Media features in upstream kernels. Not just HEVC decoding.

DG1 Xe MAX 8086:4905 Linux 5.15, 6.0, 6.1, 6.2-rc6, drm-tip media-driver 23.1.0 ENABLE_PRODUCTION_KMD=OFF ... Also confirmed with intel-gpu-i915-backports + ENABLE_PRODUCTION_KMD=ON everything works fine.

I just tried mix of above i.e. latest media stack + ENABLE_PRODUCTION_KMD=OFF, with latest backport kernel DKMS modules for Ubuntu 22.04 (=jammy/arc).

With that setup, both latest sample_multi_transcode and FFmpeg (v6.0) die to double free or corruption (!prev) when trying to transcode AVC video on DG1.

Strace shows libc to detect memory corruption after ioctl fails, and driver destroys the GPU context:

...
ioctl(3, DRM_IOCTL_I915_GEM_MMAP_OFFSET, 0x7fff3ff83970) = -1 EINVAL (Invalid argument)
ioctl(3, DRM_IOCTL_I915_GEM_MADVISE, 0x7fff3ff83aac) = 0
ioctl(3, DRM_IOCTL_I915_GEM_VM_DESTROY, 0x564d1fd65160) = 0
ioctl(3, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7fff3ff83d00) = 0
ioctl(3, DRM_IOCTL_I915_GEM_VM_DESTROY, 0x564d1fd67950) = 0
ioctl(3, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7fff3ff83dc0) = 0
writev(2, [{iov_base="double free or corruption (!prev"..., iov_len=33}, {iov_base="\n", iov_len=1}], 2double free or corruption (!prev)
) = 34
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa453564000
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid()                                = 9
getpid()                                = 9
tgkill(9, 9, SIGABRT)                   = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=9, si_uid=65534} ---
+++ killed by SIGABRT +++

Why media-driver cannot check and use available APIs like Intel compute driver does?

PS. That same thing (backport kernel + non-production media driver combo) seems to work fine on DG2.

nyanmisaka commented 1 year ago

Currently DG1 works only when pairing media-driver(PROD_KMD=ON) with Intel DKMS. Any other combinations will break the transcoding pipeline.

The release note says DG1/SG1 is supported but it seems they only verified on the internal DKMS and ignored the upstream kernel.

image

Hyy2001X commented 1 year ago

Currently DG1 works only when pairing media-driver(PROD_KMD=ON) with Intel DKMS. Any other combinations will break the transcoding pipeline.

The release note says DG1/SG1 is supported but it seems they only verified on the internal DKMS and ignored the upstream kernel.

image

Is there any progress in transcoding dg1 on linux?thanks

nyanmisaka commented 1 year ago

Nothing has changed since then. DG1 is not supported in Linux mainline, so the DKMS is still required.

eero-t commented 1 year ago

Did quick checking with the latest media driver release on top of the latest i915 DKMS kernel driver installed from the Intel package repositories (using Ubuntu 22.04 Arc repo): https://dgpu-docs.intel.com/driver/client/overview.html

When driver is compiled with ENABLE_PRODUCTION_KMD=OFF, it does not recognized DG1, and crashes to double free before exit.

When driver is compiled with ENABLE_PRODUCTION_KMD=ON, transcoding succeeds fine with the public Intel DKMS kernel driver:

=> updated title.

PS. If one is already using Intel driver repo, it normally makes sense to install packages from there. I just want to make sure that my own driver builds work also, so that I'm not relying on Intel package repo updates & versions (regardless of whether I'm building kernel, or user space).

eero-t commented 1 year ago

DG1 has lost the availability of all Media features in upstream kernels. Not just HEVC decoding.

@nyanmisaka You might want to file a separate bug about that (and refer this one), as it's a regression. Whereas HEVC has AFAIK never worked properly on DG1 with the upstream kernel.

nyanmisaka commented 1 year ago

We ship the self-built media-driver and VPL with our software so we can’t use the DKMS and PRODUCTION_KMD=ON.

as it's a regression.

Correct, I read somewhere in this post and compute-runtime, it seems that DG1 did work with the drm-tip kernel when the mainline was on 5.15. But it’s too hard to bisect the drm-tip since it gets updated almost every day.

eero-t commented 1 year ago

Tested latest Git build of Mesa 3D driver on DG1 & DG2 (Arc). It aborted Weston & Xwayland with backport DKMS and worked only with upstream kernel (e.g. v6.3 drm-tip). I did not see any Mesa config option for enabling support for backport DKMS (like the media driver option for "production KMD").

=> Media driver working just with backport DKMS ties one also to whatever Mesa version is in the same repository?

PS. On quick testing public compute-runtime built from latest git tag worked with both KMD versions on DG1.

Hyy2001X commented 1 year ago

Nothing has changed since then. DG1 is not supported in Linux mainline, so the DKMS is still required.

Have you successfully passed through DG1 to Windows or other systems on platforms such as PVE or ESXi? I think Windows has better support for transcoding, but I have tried several times and failed. Thanks😊

nyanmisaka commented 1 year ago

I tried DG1 on Windows 10 host last year, where it worked fine.

Sherry-Lin commented 1 year ago

Tested latest Git build of Mesa 3D driver on DG1 & DG2 (Arc). It aborted Weston & Xwayland with backport DKMS and worked only with upstream kernel (e.g. v6.3 drm-tip). I did not see any Mesa config option for enabling support for backport DKMS (like the media driver option for "production KMD").

=> Media driver working just with backport DKMS ties one also to whatever Mesa version is in the same repository?

PS. On quick testing public compute-runtime built from latest git tag worked with both KMD versions on DG1.

@eero-t are you using the Mesa from https://github.com/intel-gpu/Mesa/tags or it's upstream Mesa?