intel / vpl-gpu-rt

MIT License
109 stars 92 forks source link

Sometimes the driver seems to die with device error -17 and only recovers via a reboot #290

Open oviano opened 1 year ago

oviano commented 1 year ago

This is Windows 11 Pro, 3802 driver, 2022.2.5 oneVPL release.

For example, I'm transcoding with FFmpeg decoding with hevc_qsv and transcoding to av1_qsv and it's working fine. I then leave the PC idle for an hour or two, come back and run the exact same command and this time I get the dreaded device error -17 from the tool when trying to decode. It never recovers, repeated attempts with the previously-working command just produce the same errors.

It happens perhaps once or twice a day, apparently randomly. I.e. it is not preceded by any other errors or unexpected behaviour.

I have tried disabliing/renabling the device in control panel, but it seems the only cure is to reboot the PC.

Firstly, is this a known issue?

Secondly, is there another approach I can take from an API level that would "reset" the state of the device?

The GPU is being used only for decoding/encoding. I use the integrated intel GPU for the display.

rupakroyintel commented 1 year ago

@oviano This type of issue can be caused by the additional allocation of hardware buffers by FFmpeg that weren't freed in time. Could you please provide the command line that you are using and the screenshot of the error that you got? Thanks.

oviano commented 1 year ago

Next time it happens, I'll post back with more details.

From memory, it isn't one particular command, and it can work for hours and then suddenly produce this error.

Would a memory dump of my system help when it gets into this state?

rupakroyintel commented 1 year ago

@oviano Yes, a memory dump might be useful. Even if you do not know exactly which command line is the problem, some example command lines could give us some clues. Meanwhile, you can try to update the driver. 31.0.101.3959 is the latest beta driver for Intel® Arc™ A-Series Graphics.

oviano commented 1 year ago

Here's a command that produced the error yesterday. Unfortunately, it produced the error but then also crashed the machine completely so I couldn't screen grab it. But in the couple of seconds before it died completely I saw error -17. I'll post the full trace next time it happens and doesn't crash.

ffmpeg -init_hw_device qsv=intel,child_device=1 -i football.mp4 -pix_fmt p010 -vcodec av1_qsv -profile:v main -level 51 -preset veryslow -scenario archive -vb 3000k -maxrate 3000k  -look_ahead_depth 100 -g 300 football_qsv_av1_3000k_cbr.mp4 -y

By the way, it has also happened with variations of the above; without look_ahead_depth, without 8 -> 10 bit conversion, different bitrates etc. Also tried decoding in hardware too (above is decoding in software). Source is HEVC. Tried async_depth = 1, I've basically tried tweaking the command in numerous ways but it always either produces device error -17 or crashes the PC eventually.

Maybe this issue relates to my ongoing stability issues, maybe it's even the same issue:

https://community.intel.com/t5/Graphics/Both-Intel-ARC-A380-and-A770-crash-my-Windows-11-system/td-p/1432423

(there are a couple of memory dumps in my recent post in that thread, one of which was produced after a freeze following a command like the above, although it did not display a device error -17).

oviano commented 1 year ago

The new driver is no better.

I have a batch file which executes 12 transcodes similar to the above, with different bitrates and max rates (to switch between cbr and vbr).

With this A380 in this machine I have yet to have it complete the encodes without crashing the machine, either this beta driver or the previous two drivers (3802 and 3490).

On two occasions today it has blue screened and claimed to be writing a dump file but stuck on 0% and never progressed and I was forced to power cycle.

The device -17 error is quite rare and I wonder if it’s the same underlying issue as these freezes/crashes but manifesting in a different way (if it was memory corruption, say). I saw it earlier but it reset the machine a few seconds later.

Full machine spec is:

Dan A4 v4 Case MSI Z690I Unify i7-12700 Corsair CMK32GX5M2B5200C40 Seagate Firecuda 530 2TB x 2 Noctua NH-L9i-17xx Chromax Black Corsair SF600

The A380 is only being used for these encodes ie I use the IGD for display. This is because it is so unstable that if I set it as the main display GPU it can crash the machine even when it is sitting idle.

I’ve observed, via the ARC panel the temperatures etc of the GPU up to the point it fails and there is nothing untoward. The GPU rises from 55 degrees to 56 or 57 during the encode, the fan kicks in and out occasionally.

I have posted all this info in the community forum as per the link above, and I followed all the recommendations for reinstalls, clean driver installations etc.

I raised the suggestion that maybe it was something about my motherboard being a rare one in that it has two DisplayPort IN ports. I noticed another user had described similiar issues to me and according to his system report he also had a motherboard with DisplayPort IN ports, which is quite rare. Maybe just a coincidence and nothing to do with the issue but seemed worthy of consideration to me, but perhaps I am clutching at straws.

I don’t have an infinite amount of time to spend on this so maybe you guys can try and reproduce on a similiar motherboard/setup. Bear in mind that this is the third ARC GPU that has produced these issues for me, an A380 that I RMA’d a few weeks ago, an A770 I have on the shelf and this new A380 I got last week. I don’t think it is a fault with my PC because it hosts an NVIDIA 3060 without any stability issues at all, FFmpeg nvenc encodes work fine and it doesn’t crash or freeze at all when the NVIDIA GPU is present.

I don’t think it’s a bug in FFmpeg as when I make the Arc my main GPU for display it freezes randomly when I’m not using FFmpeg at all.

Happy to provide further info such as my source for the transcodes etc if it helps

rupakroyintel commented 1 year ago

@oviano Thanks for sharing the details with us. We have made the relevant team aware of this issue. We will get back to you soon.

ma3uk commented 1 year ago

I have a similar problem. The problem can arise simply by starting a new encoding process immediately after the previous one, the error can be either -17 or -16

mav-intel commented 1 year ago

Transferring to the appropriate project