projectacrn / acrn-hypervisor

Project ACRN hypervisor
BSD 3-Clause "New" or "Revised" License
1.13k stars 516 forks source link

GPU passthrough introduce huge latencies and crushes the system #8551

Open SIA-77 opened 8 months ago

SIA-77 commented 8 months ago

Describe the bug GPU passthrough leads to significant sporadic interrupts due to VM exit

Platform i8250u 8Gb (board.txt attached)

Codebase ACRN Hypervisor v3.2 ACRN kernel v3.2

Scenario Industrial, 1 RTVM and 1 HMI VM with GPU passthrough

To Reproduce [Steps to reproduce the behavior:

  1. Launch realtime VM (linux+RT patch) and any non RT VM without GPU passthrough
  2. Run any cyclic job and measure latencies. It is better to use 100us period so you can see the latency sparks very soon (90s was always enough in our case). It is also happens when period is 1ms or higher but more time required to detect latency sparks
  3. Do the same with GPU passthrough

Expected behavior We expected almost the same performance with some degradation.

Additional context Without GPU passthrough (100us period): Max wakeup latency (jitter) = 17,2us Max period = 114,382us VMEXIT_IO_INSTRUCTION 0% (according to acrntrace. CSV attached)

With GPU passthrough (100us period):

Max wakeup latency = 14493,32us Max period = 14589,257us VMEXIT_IO_INSTRUCTION 7.36% (according to acrntrace. CSV attached)

We tried both i915.modeset=0 and 1.

The test device works with monitors with resolution equal or less than 1024x768. If we connect HDMI to high resolution monitor (1980x1280) we face crushing of the host OS and the entire system.

GPU passthrough makes it impossible to use GPU passthrough with RT systems. Are there any means to reduce latencies or at least avoid the system from crushing?

We thought of the following means (but haven't tested them yet):

log_acrn.csv log_wo_overrun.csv board.txt launch_user_vm_id1.txt launch_user_vm_id2.txt

SIA-77 commented 7 months ago

UPD: After we added 8Gb RAM no more crashes were observed. Short tests (up to 1M cycles) shows excellent results (no overruns, stable cycle). We launched the test with 1,1Bn cycles. Settings:

Cycle Period: 200us
Job time: 140us
Jitter Threshold: 5us

Results:

Threshold overruns - 8707 (task started later than expected start time + 5us)
Overruns: 2
Max wake up latency - 15107,568us
Max period - 15302,57us

Difference between expected execution time and actual execution time was 30048,31us.

Which means we have 2 overruns about 15ms each.

We suspect this is somehow related to 1G hugepages (remapping, moving or something else) and followed by VM_exit for a significant amount of time. The problem happens occasionally, we think it is somehow connected to the memory management engine. More memory = less chances to face the issue, but it is still possible. Any ideas how to fix this? This behavior could be very dangerous for real time systems.

SIA-77 commented 6 months ago

Well, the problem was solved by changing TOLUD size. Still it is not clear how to completely avoid such issues in the future. Some clarifications on that will be appreciated.