Suffer GPU hang by specific HEVC transcoding in CML

zcwang commented 4 years ago

Need help on GPU hang issue of HEVC transcoding in CML.

It will cause GPU hang by following command with specific HEVC video (sample video about 5xxMB in here).

Command ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i target-HEVC-video.mkv -vf 'deinterlace_vaapi=rate=field:auto=1,scale_vaapi=w=1920:h=1080' -c:v hevc_vaapi output.mp4
Test Environment OS: Ubuntu 18.04 with kernel v5.7 or the latest i915 drm-tip kernel (v5.8-rc2 on 06-29). Open Source Media Stack: 2020’Q1 release or the latest upstream on 7/1/2020 FFmpeg vresion: the latest code in upstream on 7/1 (commit id--> e409262837 avutil/common: Fix integer overflow in av_ceil_log2_c()) vainfo: VA-API version: 1.8 (libva 2.8.0.pre1) vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 20.3.pre (adc23261)
GPU Hang in vcs, ... Jul 1 11:50:46 intel-NUC kernel: [ 9831.062462] i915 0000:00:02.0: [drm] Resetting vcs0 for preemption time out Jul 1 11:50:46 intel-NUC kernel: [ 9831.062468] i915 0000:00:02.0: [drm] ffmpeg[3208] context reset due to GPU hang Jul 1 11:50:46 intel-NUC kernel: [ 9831.062510] i915 0000:00:02.0: [drm:__i915_request_reset [i915]] client ffmpeg[3208]: gained 1 ban score, now 1 Jul 1 11:50:46 intel-NUC kernel: [ 9831.063554] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:4:a8fffffd, in ffmpeg [3208] …

ERROR: 0x00000000 DONE_REG: 0xffffffff FAULT_TLB_DATA: 0x00000011 0xb442c1b0 Address 0x00001b442c1b0000 GGTT GTT_CACHE_EN: 0xf0007fff vcs0 command stream: CCID: 0x00000000 START: 0x00011000 HEAD: 0x00000268 [0x00000230] head = 0x00000268, wraps = 0 TAIL: 0x00000ee0 [0x00000270, 0x00000298] CTL: 0x00003001 len=16384, enabled MODE: 0x00000000 HWS: 0xfffe3000 ACTHD: 0x00000000 000b3924 at ring: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x13000002 ESR: 0x00000000 INSTDONE: 0xbbffffff batch: [0x00000000_000b3000, 0x00000000_000bb000] BBADDR: 0x00000000_000b3925 BB_STATE: 0x00000020 INSTPS: 0x00009080 INSTPM: 0x00000000 FADDR: 0x00000000 000b3b00 RC PSMI: 0x00000010 FAULT_REG: 0x00000000 GFX_MODE: 0x00008000 PDP0: 0x00000006237ef000 PDP1: 0x0000000000000000 PDP2: 0x0000000000000000 PDP3: 0x0000000000000000 engine reset count: 0 ELSP[0]: pid 2486, seqno 18:00000044, prio 0, head 00000e70, tail 00000ee0 ELSP[1]: pid 2485, seqno 1c:00000002, prio 0, head 00000000, tail 00000068 Active context: ffmpeg[2486] prio 0, guilty 1 active 0, runtime total 4540598ns, avg 3970720ns

Please refer log files, ffmpeg-gpu-hang-gary-0701.zip

zcwang commented 4 years ago

Issue cannot be duplicate by MSDK’s transcoding sample with command “sample_multi_transcode -i::h265 ~/input.h265 -deinterlace -o::h265 test-output.h265 -w 1920 -h 1080”

The successful transcoded video with 1080p resolution (from 2160p) I put here.

fulinjie commented 4 years ago

Ping. This gpu hang accidentally occurs in decoding procedure for some clips with missing refs.

dmitryermilov commented 4 years ago

Issue cannot be duplicate by MSDK’s transcoding sample with command “sample_multi_transcode -i::h265 ~/input.h265 -deinterlace -o::h265 test-output.h265 -w 1920 -h 1080”

The successful transcoded video with 1080p resolution (from 2160p) I put here.

@fulinjie , if msdk decoder can handle the stream, perhaps a WA is possible on ffmpeg side?

fulinjie commented 4 years ago

Hi @dmitryermilov ,

The main reason for this issue is that: • The clips doesn’t start from an IRAP frame (Intra random access point) o Hence the first 50 frames lack the valid reference list, they could not be decoded correctly. o Also missing reference in application level leads to the Null pointer in driver, however it should not leads to GPU hang;

It seems to be related with error tolerant/handling case for Null pointer in driver. • Note that it’s only reproduced in multi-thread mode, “-threads 1” would not trigger this GPU Hang;

• The reason MSDK is workable: o Sample decode seems to have checked the reference list dependency, and simply skipped the first 50 invalid frames; Hence it only decoded the last 50 decodable frames; • $ ./sample_decode h265 -i input-100frames.h265 -o /dev/null o Decoding started o Frame number: 50, fps: 12.097, fread_fps: 0.000, fwrite_fps: 12.712 o Decoding finished

@fulinjie , if msdk decoder can handle the stream, perhaps a WA is possible on ffmpeg side?

Yes, I'm working on some WA in FFmpeg to skip the invalid frames (which contradicts the native decoding pipeline), but IMHO it would be better to have GPU hang somehow prevented no matter whether we had the "valid check" or not.. (Note that only some of the bitstreams with missing reference would lead to this GPU hang)

Ps. FYI, internal discussion is accessible in: https://jira.devtools.intel.com/browse/VIZ-16147

dmitryermilov commented 4 years ago

Yes, I fully understand you, @fulinjie . It goes without saying that UMD should attempt to prevent GPU hangs. My point is, ideally, each component in media stack should be error tolerant. When problems, which one component in media layer can't handle, will be handled by another component.

simply skipped the first 50 invalid frames

The motivation here is not just "simply" skip as many as possible frames :) There should be a balance between:

following decoding process how it's described in the spec
user experience. I mean even if we can output these 50 frames (which will be fully corrupted) without GPU hang, does a user really want to watch them in the screen?
error tolerance and error recovery

fulinjie commented 4 years ago

simply skipped the first 50 invalid frames The motivation here is not just "simply" skip as many as possible frames :) There should be a balance between:

following decoding process how it's described in the spec

user experience. I mean even if we can output these 50 frames (which will be fully corrupted) without GPU hang, does a user really want to watch them in the screen?

error tolerance and error recovery

Yep, agree. These skipped frames are useless and are with garbage in this clips, and would be better to be skipped. And that's the reason I'm working on some WAs in FFmpeg to start decoding from IRAP frames: https://github.com/fulinjie/ffmpeg/commit/8926ae48ba7316cdebe59c27d4b6a01bb766ce00

The gpu hang could be hide after applying above patch. However since we've caught this hang issue, IMHO it would be good if we could add corresponding error tolerance in media-driver.

XinfengZhang commented 4 years ago

@wangyan-intel could we add a check when call in EndPicture, if there are no reference frame, media-driver should return failure to avoid gpu hang, not send real command buffer to GPU

zcwang commented 4 years ago

@XinfengZhang sorry for bothering you. May I know any possible direction on this issue?

wangyan-intel commented 4 years ago

I will take a look. Sorry for slow response.

wangyan-intel commented 4 years ago

@weizhu-intel Could you please help take a look? Thanks.

weizhu-intel commented 4 years ago

Hi Linjie&zcwang, I have a try on my side, and found that, ffmpeg still pass ref_pic_id even reference is missed. This will cause some unexpected issue. Sometimes it has no gmm resourceinfo then we can detect it in endpicture, then return error. Sometimes error gmm resource info, this will lead to hang.

So could you pass in_valid_surfaceid instead of correct ref_pic_id if reference is missed, then our driver can detect this.

Thanks wayne zhu

zcwang commented 4 years ago

Issue was fixed by following patch (i.e. intel-ffmpeg-patechset included in Media-Driver 2020Q3 release, but not in upstream), https://github.com/intel-media-ci/intel-ffmpeg-patch/blob/master/0057-lavc-vaapi_hevc-add-skip_frame-invalid-to-skip-inval.patch

zcwang commented 4 years ago

@weizhu-intel and @dmitryermilov, Do you think this issue should be fixed by ffmpeg's patch or media-driver? Thanks! https://patchwork.ffmpeg.org/project/ffmpeg/list/?series=2021 Gary

Jexu commented 1 year ago

This issue should have been fixed in latest media driver, could you try it again? Hang is gone on my side with latest driver. By the way, driver fix is to skip the decoding if ref frame missed.

Jexu commented 1 year ago

Let me close this issue now since fixed in media driver and you can also add strict check for invalid reference frame in ffmpeg or vpl as option. Please re-open it again if having any other questions.

intel / media-driver

Suffer GPU hang by specific HEVC transcoding in CML #992