intel / intel-vaapi-driver

VA-API user mode driver for Intel GEN Graphics family
https://01.org/linuxmedia
Other
307 stars 126 forks source link

Enabling mb_rate_control kills whole machine (Skylake GT2) #172

Open fhvwy opened 7 years ago

fhvwy commented 7 years ago

Build ffmpeg git master with @mypopydev's patch to add the mb_rate_control option: https://lists.ffmpeg.org/pipermail/ffmpeg-devel/2017-May/211334.html.

Input file doesn't seem to matter much. To be consistent I am using the Big Buck Bunny 1080p file here.

Take steps to avoid data loss (remount all data mounts readonly, sync).

Run:

./ffmpeg_g -y -threads 1 -hwaccel vaapi -hwaccel_output_format vaapi -i bbb_1080_264.mp4 -an -c:v h264_vaapi -b 1M -mb_rate_control 1 /tmp/out.h264

After some frames (not repeatable between runs, but at most a few hundred) the machine becomes completely unresponsive.

On some runs I get a GPU hang log on the console (transcribed) before it locks up, but not consistently:

[drm] GPU HANG ecode 9:0:0x8fd0ffff, in ffmpeg_g [2669], reason: Hang on render ring, action: reset
[drm] {the usual GPU hang bug warning}
[drm] drm/i915: Resetting chip after gpu hang
[drm:i915_reset [i915]] *ERROR* Failed to reset chip: -110

Power-cycle to recover the machine.

Setup:

There are probably at least two issues here: in the VAAPI driver (because enabling mb_rate_control has broken the GPU) and in the kernel (because it didn't recover). I've only sent this here because the reproducer is here, but please do forward this if appropriate.

Possibly relevant: The same ffmpeg command with the mb_rate_control option works fine on a Skylake 6260U (GT3, 48 EUs). Could there be something about the proprietary shader binaries which only works on the larger GPU and breaks horribly on the smaller one?

fhvwy commented 7 years ago

Behaviour is identical with the 1.8.1 release.

Whether console output appears or not appears to depend on whether the full DRM framebuffer is being used. If it is, then taking out the GPU kills the output entirely and I don't get anything. If not, the output doesn't die and gives the log above before locking up. Maybe a serial console would be able to get more output if there is any (a panic log, perhaps)?

fhvwy commented 7 years ago

Has anyone been able to reproduce this? The failure is completely consistent for me, always killing the whole machine when running as above.

Is there anything else I can do to help debug it?

xhaihao commented 7 years ago

@fhvwy we will give a try with your patch.

Brainiarc7 commented 7 years ago

I'll test this on a similar workstation and report back.

wangzj0601 commented 6 years ago

Can not duplicate this issue after apply the patch FFmpeg-devel-V3-lavc-vaapi_encode_h264-Enable-MB-rate-control..patch(apply the patch by copying the code line by line because the patch is too old) with ffmpeg commit 991eca0f8729043724ae4574be0eb4c20bdba915 cmdline: ./ffmpeg_g -y -threads 1 -hwaccel vaapi -hwaccel_output_format vaapi -i /media/h264_container/720p.mp4 -an -c:v h264_vaapi -b:v 1M -mb_rate_control 1 ./out.h264

Env Processor: Skylake ULX (Intel(R) core(TM) m5-6Y57 CPU
GT info: GT2 (0x191E) Kernel version: 4.12.0-rc2 ffmpeg: repo https://git.ffmpeg.org/ffmpeg.git commit c+patch FFmpeg-devel-V3-lavc-vaapi_encode_h264-Enable-MB-rate-control..patch(apply the patch by copying the code line by line because the patch is too old) Libva: 2.0.1.pre1 master branch commit 51e98b1224794a44ba097baa7a1b4e35c3596d0c intel_driver:  2.0.1.pre1 master branch commit 35fc70f09e343ccb91c7957757ec27a5c0f9fcd1 repo: https://github.com/01org/intel-vaapi-driver.git

wangzj0601 commented 6 years ago

upload my patched file vaapi_encode_h264.c, you can use this file with changing extension .c instead of native vaapi_encode_h264.c in ffmpeg commit 991eca0f8729043724ae4574be0eb4c20bdba915 vaapi_encode_h264.txt

fhvwy commented 6 years ago

I tried this again on the same machine (Skylake 6300), with slightly newer software. The problem persists, but the machine is no longer hard-reset by the operation so I am able to extract some debug information. The graphics core is still completely dead, and doesn't work at all until the machine is rebooted.

Using:

Kernel output:

[ 2249.401011] [drm] GPU HANG: ecode 9:0:0x8fd0fffe, in ffmpeg_g [9317], reason: Hang on rcs0, action: reset
[ 2249.401012] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 2249.401012] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 2249.401012] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 2249.401013] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 2249.401013] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 2249.401028] drm/i915: Resetting chip after gpu hang
[ 2250.107308] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[ 2250.107433] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

DRM error dump: http://ixia.jkqxz.net/~mrt/i965/bug172_drm_error.

wangzj0601 commented 6 years ago

I try one another SKL unit, this issue still can not be duplicated with ffmpeg commit 991eca0f8729043724ae4574be0eb4c20bdba915 + patch FFmpeg-devel-V3-lavc-vaapi_encode_h264-Enable-MB-rate-control..patch

CPU: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz VGA: VGA compatible controller [0300]: Intel Corporation Sky Lake Integrated Graphics [8086:1912] (rev 06) ffmpeg compilation cmd: --enable-vaapi --prefix=/opt/yami/ffmpeg

Whole info. during run ffmpeg command with option mb_rate_control as below root@yami-skl:~/build/ffmpeg# ./ffmpeg_g -y -threads 1 -hwaccel vaapi -hwaccel_output_format vaapi -i /media/h264_container/720p.mp4 -an -c:v h264_vaapi -b:v 1M -mb_rate_control 1 ./out.h264 ffmpeg version N-88605-g991eca0 Copyright (c) 2000-2017 the FFmpeg developers built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.4) 20160609 configuration: --enable-vaapi --prefix=/opt/yami/ffmpeg libavutil 56. 0.100 / 56. 0.100 libavcodec 58. 1.100 / 58. 1.100 libavformat 58. 2.100 / 58. 2.100 libavdevice 58. 0.100 / 58. 0.100 libavfilter 7. 0.101 / 7. 0.101 libswscale 5. 0.101 / 5. 0.101 libswresample 3. 0.101 / 3. 0.101 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/media/h264_container/720p.mp4': Metadata: major_brand : isom minor_version : 512 compatible_brands: isomiso2avc1mp41 encoder : Lavf57.26.100 Duration: 00:00:03.34, start: 0.000000, bitrate: 4096 kb/s Stream #0:0(eng): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x720 [SAR 1:1 DAR 16:9], 4092 kb/s, 29.98 fps, 29.97 tbr, 16016 tbn, 60.67 tbc (default) Metadata: handler_name : VideoHandler Stream mapping: Stream #0:0 -> #0:0 (h264 (native) -> h264 (h264_vaapi)) Press [q] to stop, [?] for help Output #0, h264, to './out.h264': Metadata: major_brand : isom minor_version : 512 compatible_brands: isomiso2avc1mp41 encoder : Lavf58.2.100 Stream #0:0(eng): Video: h264 (h264_vaapi) (High), vaapi_vld, 1280x720 [SAR 1:1 DAR 16:9], q=0-31, 1000 kb/s, 29.97 fps, 29.97 tbn, 29.97 tbc (default) Metadata: handler_name : VideoHandler encoder : Lavc58.1.100 h264_vaapi frame= 100 fps=0.0 q=-0.0 Lsize= 396kB time=00:00:03.30 bitrate= 981.0kbits/s speed=13.6x video:396kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000000%

fhvwy commented 6 years ago

@wangzj0601 What input file are you using? Have you tried encoding more than 100 frames? The failure is very consistent for me, but how long it takes varies by file and other settings (though usually around 200 frames).

E.g. with the 1080p "Big Buck Bunny" file running:

./ffmpeg_g -v 55 -y -hwaccel vaapi -hwaccel_output_format vaapi -i bbb_1080_264.mp4 -an -c:v h264_vaapi -b:v 1M -mb_rate_control 1 out.h264

the GPU always dies when encoding frame 234.

Wrt the SKU you are using, have you tried one with 23 EUs rather than 24? That is one possible difference which I suggested above and haven't been able to check. (I think both the 6Y57 and 6600K will have 24, though do correct me if I'm wrong.)

xhaihao commented 6 years ago

@wangzj0601 could you try ffmpeg 2fdc9f7c4939f83a6c9d1f9d85b6d37ce0bab714 + http://ixia.jkqxz.net/~mrt/i965/mb_rc.patch? Mark has rebased the ffmpeg patch against a newer version of FFmpeg.

@fhvwy I think your SKL should have 24 EUs, the pci id is 0x1912 in your DRM error dump. Why do you think your machine has 23EUs?

fhvwy commented 6 years ago

@xhaihao See table and notes in https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol04-configurations.pdf - "[a] Particular SKUs produced by Intel may have one EU disabled.". It's visible at runtime in Beignet, which indicates that it has 23 compute units while other similar machines have 24. (I assume there is an ioctl() somewhere which will return how many there are.)

fhvwy commented 6 years ago
$ uname -s -v
Linux #1 SMP Debian 4.13.13-1 (2017-11-16)
$ cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-6300 CPU @ 3.80GHz
$ cat eu_count.c 
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <intel_bufmgr.h>

int main(int argc, const char **argv)
{
    const char *device;
    int err;

    if (argc == 1)
        device = "/dev/dri/renderD128";
    else if (argc == 2)
        device = argv[1];
    else {
        fprintf(stderr, "Usage: %s <drm-device>\n", argv[0]);
        return 1;
    }

    err = open(device, O_RDWR);
    if (err < 0) {
        fprintf(stderr, "Failed to open device %s: %m.\n", device);
        return 1;
    }
    int fd = err;

    unsigned int eu_total = 0;
    err = drm_intel_get_eu_total(fd, &eu_total);
    if (err < 0) {
        fprintf(stderr, "Failed to get EU total: %m.\n");
        return 1;
    }

    printf("EU total: %u\n", eu_total);

    close(fd);

    return 0;
}
$ gcc eu_count.c $(pkg-config --libs --cflags libdrm libdrm_intel)
$ ./a.out 
EU total: 23
yakuizhao commented 6 years ago

Thanks for sharing the detailed info. The current code already tries to query the EU_count by using drm ioctl.

intel->eu_total = 0; if (intel_driver_get_param(intel, LOCAL_I915_PARAM_EU_TOTAL, &ret_value)) { intel->eu_total = ret_value; }

lizhong1008 commented 6 years ago

I also tried to reproduce this issue on my KBL (i7-7567U) but failed. Looks like it only happens on specified CPU. @wangzj0601 could you try to find a skylake 6300 (or some other verisions with 23 EU) to reproduce it?