intel / libyami

Yet Another Media Infrastructure. it is core part of media codec with hardware acceleration, it is yummy to your video experience on Linux like platform.
Apache License 2.0
147 stars 106 forks source link

Intermittent failed case for VPP of CSC+Sharpness by using yamitranscode on Fedora and ubuntu::yakkety #773

Open FocusLuo opened 7 years ago

FocusLuo commented 7 years ago

To use the latest commit on master of yami and libva/intel-driver Test CMD: yamivpp .//1920x1080.nv12 -s 59 ./1920x1080.yv12 yamivpp .//1920x1080.yv12 -s 59 ./1280x720.i420

FocusLuo commented 7 years ago

vpp_clips.zip

xuguangxin commented 7 years ago

We have setup the fedora 25 env. We are trying to reproduce the issue

Zhziyao commented 7 years ago

I have setup fedora 25 env on different APL machines. Building yami uses configure options found in buildlog on website http://media-ci.ostc.intel.com:8810/dashboard. And I run the TEST CMD above for thousands of times. However the issue did not come out. I will try to reproduce the issue with Docker next.

uartie commented 7 years ago

What result are you expecting? I don't think this reported issue description tells the whole story.

The actual issue is that the output result from the above test command is not always the same. That is, the md5sum of the output result intermittently changes from run-to-run. The output result is compared via the md5sum output for this test, which changes from run-to-run (i.e. md5sum ./1920x1080.yv12 is not always the same).

I don't know how yamitranscode (mentioned in issue title) has anything to do with this, either.

uartie commented 7 years ago

Also, when the md5sum result is not expected I've seen associated GPU Hang on 4.10 and 4.11 kernels:

[23010.721025] drm/i915: Resetting chip after gpu hang
[23010.723370] [drm] RC6 on
[23010.724143] [drm] GuC firmware load skipped
uartie commented 7 years ago

I'm able to reproduce at least once every ~200-300 runs sequentially

uartie commented 7 years ago

md5sum of ./1920x1080.yv12 output should be f15e2b55a786fcf691f8e9d79e91653d

Zhziyao commented 7 years ago

@uartie Thank you for your detailed explanation. And I understand the issue much more clear.

xuguangxin commented 7 years ago

@uartie, ziyao used md5 sum to check the command result. It can't reproduce in APL machine, Is it possible it related to CPU step? could you share your CPU step to ziyao in the mail. So he can compare the cpu info.

xuguangxin commented 7 years ago

use "lspci -nn |grep VGA"

uartie commented 7 years ago
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:5a85] (rev 0b)
xuguangxin commented 7 years ago

ok, just checked, we do not have rev 0b.. @uartie , do you have another stepping. We also checked kernel version, we use fedora 25 it is 4.8.6-300.fc25.x86_64, it's not like your kernel version. What os version are you used?

uartie commented 7 years ago

@xuguangxin, no I don't have another stepping locally. We use Fedora 25 host with updated kernel (via dnf package manager) and Ubuntu Xenial (16.04) host with updated kernel (via apt package manager).

Please try to update your Fedora 25 packages (including kernel) via dnf update and see if that can reproduce afterwards.

Zhziyao commented 7 years ago

Sorry for not explaining my former work clearly.

  1. I updated the kernel to the latest version and ran the test on APL machine.
  2. Besides, I installed the Docker and pulled fedora 25 image from Intel repo. I set up the env with RETOOL.Then I ran the test in the container of fedora 25. However, the issue did not come out under both conditions. I also saved the message of CMD dmesg | grep -i gpu after each loop, but to find no "GPU HANG" message.
xuguangxin commented 7 years ago

Seems it's a kind of certain a stepping issue Sadly, U.Artie's stepping higher than Ziyao's Let us find a stepping rev0b

Zhziyao commented 7 years ago

Surely, it is a kind of a stepping issue. I can reproduce the Issue on the machine supplied by uartie.

uartie commented 7 years ago

Ok, please continue root-causing on the APL I've supplied you.

uartie commented 7 years ago

@Zhziyao, @xuguangxin this issue shows up on BSW, too. It's strange APL and BSW would both be caused by stepping issue.

00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:22b1] (rev 21)

xuguangxin commented 7 years ago

@Zhziyao , could you find a bsw to reproduce this issue?

xuguangxin commented 7 years ago

@Zhziyao , any update on this?

Zhziyao commented 7 years ago

I can't reproduce this issue on bsw either. And I just finish setting the test env on another machine. I wonder if there is any difference between uartie's test env and mine, which may probably leads to my failure of reproducing the issue. I will provide my host machine address to uartie on slack for checking.

uartie commented 6 years ago

Any progress with identifying/reproducing this issue on your end. I am attaching the i915_error_state generated when GPU hang occurs.

i915_error_state.gz