intel / vpl-gpu-rt

MIT License
112 stars 92 forks source link

[Bug]: GPU_HUNG when both encoder and decoder #288

Open DaveHu-TVU opened 1 year ago

DaveHu-TVU commented 1 year ago

Which component impacted?

Decode, Encode

Is it regression? Good in old configuration?

Yes, it's good in old version

What happened?

CPU: 12th Gen Intel(R) Core(TM) i7-12700 kernel: Linux tvu-desktop 5.15.0-69-generic #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

vpl: 2023Q1(https://github.com/oneapi-src/oneVPL-intel-gpu/releases/tag/intel-onevpl-23.1.5)

Reproduction steps: console1: /opt/intel/media/share/vpl/samples/_bin/sample_decode h265 -i v3_1080i5994.h265 -o /dev/null -timeout 10000

console2:/opt/intel/media/share/vpl/samples/_bin/sample_encode h264 -i cnn.yuv -o /dev/null -w 1920 -h 1080 -timeout 10000 -nv12

[ERROR], sts=MFX_ERR_GPU_HANG(-21), SynchronizeFirstTask, SyncOperation fail or timeout at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/pipeline_encode.cpp:178

[ERROR], sts=MFX_ERR_GPU_HANG(-21), GetFreeTask, m_TaskPool.SynchronizeFirstTask failed at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/pipeline_encode.cpp:2239

[ERROR], sts=MFX_ERR_GPU_HANG(-21), Run, m_pmfxENC->EncodeFrameAsync failed at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/pipeline_encode.cpp:2487

[ERROR], sts=MFX_ERR_GPU_HANG(-21), main, pPipeline->Run failed at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/Sample_encode.cpp:1970

What's the usage scenario when you are seeing the problem?

Immersive Media

What impacted?

After testing, we found that: When decoding H264/H265 encoded by intel msdk or vpl and encoding at the same time, it can work; When decoding H265 encoded by our other platform (Amba H2) and encoding at the same time, it is easy to have GPU_HUNG v3_1080i5994.zip

Debug Information

image

Do you want to contribute a patch to fix the issue?

Yes, I'm glad to submit a patch to fix it

nyanmisaka commented 1 year ago

Kernel version 5.15 is too old for 12th Gen. Install the latest linux-firmware and update kernel to 6.1 and try again.

https://github.com/intel/media-driver#known-issues-and-limitations

DaveHu-TVU commented 1 year ago

I had update kernel to 6.1.0-1015-oem and linux-firmware to 20220329 and get the sample result.

image

DaveHu-TVU commented 1 year ago

I used another cpu(12th Gen Intel(R) Core(TM) i9-12900H) ubuntu 22.04 kernel: 6.1.0-1015-oem and the latest linux-firmware

cmd: /opt/intel/media/share/vpl/samples/_bin/sample_decode h265 -i v3_1080i5994.h265 -o /dev/null -timeout 10000

Just decoding the H265 file encoded by Amba H2(v3_1080i5994.h265) platform will show the error(no encode at the same time) MFX_ERR_DEVICE_FAILED(-17). please see the log mfxlib_Pid2426_Tid140450659272512.log

Decoding started Frame number: 2560, fps: 126.187, fread_fps: 0.000, fwrite_fps: 0.0000 [ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), RunDecoding, DecodeFrameAsync returned error status at /opt/src/vpl-dispatcher_src/tools/legacy/sample_decode/src/pipeline_decode.cpp:1980 Frame number: 2561, fps: 126.234, fread_fps: 0.000, fwrite_fps: 0.000 [ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), RunDecoding, Unexpected error!! at /opt/src/vpl-dispatcher_src/tools/legacy/sample_decode/src/pipeline_decode.cpp:2100 ...

Also I can decode H265 files normally when I use intel msdk encoding(v2_500k_1080i5994.h265) Please compare the difference between these two files for decoding.

v3_1080i5994.zip

v2_500k_1080i5994.zip

nyanmisaka commented 1 year ago

No issue with ffmpeg qsv decoder (built with onevpl). I think it should be a sample_decode issue.

ffmpeg -hwaccel qsv -hwaccel_output_format qsv -c:v hevc_qsv -i v3_1080i5994.h265 -f null -
ffmpeg version 6.0-Jellyfin Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 13.1.1 (GCC) 20230429
  configuration: --prefix=/usr/lib/jellyfin-ffmpeg --target-os=linux --extra-version=Jellyfin --disable-doc --disable-ffplay --disable-ptx-compression --disable-shared --disable-libxcb --disable-sdl2 --disable-xlib --enable-gpl --enable-version3 --enable-static --enable-gmp --enable-gnutls --enable-chromaprint --enable-libfontconfig --enable-libass --enable-libbluray --enable-libdrm --enable-libfreetype --enable-libfribidi --enable-libmp3lame --enable-libopus --enable-libopenmpt --enable-libtheora --enable-libvorbis --enable-libdav1d --enable-libwebp --enable-libvpx --enable-libx264 --enable-libx265 --enable-libzvbi --enable-libzimg --enable-libshaderc --enable-libplacebo --enable-vulkan --enable-opencl --enable-vaapi --enable-amf --enable-libvpl --enable-ffnvcodec --enable-cuda --enable-cuda-llvm --enable-cuvid --enable-nvdec --enable-nvenc
  libavutil      58.  2.100 / 58.  2.100
  libavcodec     60.  3.100 / 60.  3.100
  libavformat    60.  3.100 / 60.  3.100
  libavdevice    60.  1.100 / 60.  1.100
  libavfilter     9.  3.100 /  9.  3.100
  libswscale      7.  1.100 /  7.  1.100
  libswresample   4. 10.100 /  4. 10.100
  libpostproc    57.  1.100 / 57.  1.100
[hevc @ 0x557611b8ec80] PPS id out of range: 0
    Last message repeated 1 times
[hevc @ 0x557611b8ec80] Error parsing NAL unit #3.
[hevc @ 0x557611b8da00] Stream #0: not enough frames to estimate rate; consider increasing probesize
Input #0, hevc, from 'v3_1080i5994.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Main), yuv420p(tv, progressive), 1920x540 [SAR 1:1 DAR 32:9], 59.94 fps, 59.94 tbr, 1200k tbn
libva info: VA-API version 1.19.0
libva info: Trying to open /usr/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_19
libva info: va_openDriver() returns 0
libva info: VA-API version 1.19.0
libva info: Trying to open /usr/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_19
libva info: va_openDriver() returns 0
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (hevc_qsv) -> wrapped_avframe (native))
Press [q] to stop, [?] for help
[hevc_qsv @ 0x55761217a900] More data is required to decode header
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf60.3.100
  Stream #0:0: Video: wrapped_avframe, qsv(tv, top coded first (swapped)), 1920x540 [SAR 1:1 DAR 32:9], q=2-31, 200 kb/s, 59.94 fps, 59.94 tbn
    Metadata:
      encoder         : Lavc60.3.100 wrapped_avframe
frame= 1199 fps=0.0 q=-0.0 Lsize=N/A time=00:00:19.98 bitrate=N/A speed=40.8x    0x
video:562kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
nyanmisaka commented 1 year ago

Using the bitstream filter hevc_metadata of ffmpeg can fix your v3 clip.

ffmpeg -i v3_1080i5994.h265 -bsf:v hevc_metadata -c:v copy -y v3_fixed.h265
/usr/bin/vpl-sample_decode h265 -i v3_fixed.h265 -o /dev/null -timeout 10000
DaveHu-TVU commented 1 year ago

Hi @nyanmisaka Thanks for your help. I did some test and have more infomation about this issue. I used the vpl version:2022q2 and found this verion have no this issue, then I used the vpl version:2022Q3 and found this isuue. I contine replace intel-driver to 22.4.4 when use the vpl 2023Q1 and found it can work too. So I think this is a issue in intel-driver after version > 22.4.4. Can you provide a patch to fix it ? Thanks.

DaveHu-TVU commented 1 year ago

Hi @nyanmisaka I saw your libva version is libva info: Found init function __vaDriverInit_1_19 What version of vpl are you using? I think if we use the different libva media-driver will have different result. Thanks

nyanmisaka commented 1 year ago

Libva version is not related to this issue. I'm testing the latest tag intel-onevpl-23.3.0. Also I can't test media-driver 22.4.4 since it's too old to support my Arc discrete GPU.

I'm not from intel and probably can't help you fix this. Since the regression seems to be caused by media-driver, you can open a ticket over there.

chenhao5-Intel commented 1 year ago

Hi @nyanmisaka Thanks for your help. I did some test and have more infomation about this issue. I used the vpl version:2022q2 and found this verion have no this issue, then I used the vpl version:2022Q3 and found this isuue. I contine replace intel-driver to 22.4.4 when use the vpl 2023Q1 and found it can work too. So I think this is a issue in intel-driver after version > 22.4.4. Can you provide a patch to fix it ? Thanks.

Hi Dave. What's the test scenario of your above effort? Decode v3_1080i5994.h265 + Encode cnn.yuv? Or just decode test on your i9-12900H platform? The error returned is MFX_ERR_GPU_HANG(-21) or MFX_ERR_DEVICE_FAILED(-17)?

DaveHu-TVU commented 1 year ago

Q1:Decode v3_1080i5994.h265 + Encode cnn.yuv Q2: Both 12900H and 12700 Q3:MFX_ERR_GPU_HANG(-21)

DaveHu-TVU commented 1 year ago

Hi @chenhao5-Intel I have also reproduced the decoding failure MFX_ERR_DEVICE_FAILED(-17) using 2022q2, but it seems to be more difficult to reproduce, I haven't found a stable way to reproduce it yet, I'm working on it. force on 2023Q1 issue first, Thanks

chenhao5-Intel commented 1 year ago

Q1:Decode v3_1080i5994.h265 + Encode cnn.yuv Q2: Both 12900H and 12700 Q3:MFX_ERR_GPU_HANG(-21)

You mean you can reproduce the encode hang issue: "[ERROR], sts=MFX_ERR_GPU_HANG(-21), SynchronizeFirstTask, SyncOperation fail or timeout at /opt/src/vpl-dispatcher_src/tools/legacy/Sample_encode/src/pipeline_encode.cpp:178" on both 12900H and 12700?

chenhao5-Intel commented 1 year ago

Hi @DaveHu-TVU @nyanmisaka I have successfully reproduced this issue on both i7-12700 and i9-12900H + Ubuntu 22.04 env.

There are two issue scenarios: (On both i7-12700 and i9-12900H)

  1. When just decode v3_1080i5994.h265 which is encoded by Amba H2, sample_decode will report GPU_HANG: _Decoding started Frame number: 260541, fps: 2394.031, fread_fps: 0.000, fwrite_fps: 0.000 [ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, DecodeFrameAsync returned error status at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/pipeline_decode.cpp:1980 Frame number: 260542, fps: 2394.034, fread_fps: 0.000, fwrite_fps: 0.000 [ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, Unexpected error!! at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/pipeline_decode.cpp:2100 [ERROR], sts=MFX_ERR_GPU_HANG(-21), main, Pipeline.RunDecoding failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/sampledecode.cpp:904

Driver log shows no related errors reported and VPL log shows cm_mem_copy.cpp[Line: 3115]CopyVideoToSys: returns MFX_ERR_GPU_HANG. Analysis WIP.

  1. When decoding v3_1080i5994.h265 and meanwhile encoding cnn.yuv, both decode and encode will report GPU_HANG: For decode: _Decoding started Frame number: 1586, fps: 57.032, fread_fps: 0.000, fwrite_fps: 6847.8366 [ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, DecodeFrameAsync returned error status at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/pipeline_decode.cpp:1980 Frame number: 1587, fps: 57.068, fread_fps: 0.000, fwrite_fps: 6849.847 [ERROR], sts=MFX_ERR_GPU_HANG(-21), RunDecoding, Unexpected error!! at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/pipeline_decode.cpp:2100 [ERROR], sts=MFX_ERR_GPU_HANG(-21), main, Pipeline.RunDecoding failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_decode/src/sampledecode.cpp:904

For encode: _Processing started Frame number: 1600 [ERROR], sts=MFX_ERR_GPU_HANG(-21), SynchronizeFirstTask, SyncOperation fail or timeout at /opt/src/sources/oneVPL-disp/tools/legacy/sample_encode/src/pipeline_encode.cpp:178 [ERROR], sts=MFX_ERR_GPU_HANG(-21), GetFreeTask, m_TaskPool.SynchronizeFirstTask failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_encode/src/pipeline_encode.cpp:2239 [ERROR], sts=MFX_ERR_GPU_HANG(-21), Run, m_pmfxENC->EncodeFrameAsync failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_encode/src/pipeline_encode.cpp:2487 [ERROR], sts=MFX_ERR_GPU_HANG(-21), main, pPipeline->Run failed at /opt/src/sources/oneVPL-disp/tools/legacy/sample_encode/src/sampleencode.cpp:1970 Frame number: 1680 Encoding fps: 324

Analyzed log and found that LibVA will report: [LIBVA]:CRITICAL - StatusReport:261: Something unexpected happened in HW, return error to application

As for MFX_ERR_DEVICE_FAILED(-17), it may be a duplicate issue of GPU_HANG. So next step let us focus on decoding v3_1080i5994.h265 scenario first as it may affect the two other issue.

If you have any question, please let me know. Thanks.

BRs, Hao

DaveHu-TVU commented 1 year ago

Hi @chenhao5-Intel We are using VPL2023Q1, so the version I compiled is oneVPL GPU Runtime 2023Q1 Release - 23.1.5 (libmfx-gen.1.2.8) I have reproduced the issue on 12900H with different video formats 1080p5994 1080i5994 720p5994 and put the console log in the attachment. Also I've intercepted the video of the same clip with different encoding and put it in the github comments. The one starting with msdk is generated with media sdk encoding and the one developed by amba is generated with amba h2 encoding. [Uploading msdk_1080p5994.zip…]()

amba_720p5994.zip amba_1080i5994.zip

chenhao5-Intel commented 1 year ago

Hi @DaveHu-TVU and all,

We have root-caused this issue. We have updated the codes and will open source it soon.

To check this at your side, please test it on i9-12900H, run "export INTEL_MEDIA_RESET_WATCHDOG=0" first and then run sample app commands. There should be no issues.

For Linux i7-12700, please refer to this known issue: https://community.intel.com/t5/Media-Intel-oneAPI-Video/GPU-hangs-when-decoding-2-HEVC-UHD-streams-444-10-bits-Y410/td-p/1431771

DaveHu-TVU commented 1 year ago

OK, Thanks for your help, @chenhao5-Intel