kata-containers / agent

Kata Containers version 1.x agent (for version 2.x see https://github.com/kata-containers/kata-containers). Virtual Machine agent for hardware virtualized containers
https://katacontainers.io/
Apache License 2.0
242 stars 113 forks source link

Wait explicitly for VFIO devices to complete hotplug, instead of relying on PCI rescan #850

Closed dgibson closed 3 years ago

dgibson commented 4 years ago

Currently the only thing that "ensures" that VFIO devices are actually probed in the VM before the container executes is a forced PCI rescan. That has a bunch of problems as described in issue #781 .

This PR, in conjuction with a runtime PR to come shortly avoids the rescan by instead having the agent explicitly wait for the expected VFIO devices to become present.

dgibson commented 4 years ago

I'm making the claim that no backport is needed on the basis that while this is a bug, it's apparently more or less worked up to this point.

dgibson commented 4 years ago

/test

codecov[bot] commented 4 years ago

Codecov Report

Merging #850 (bdc00e8) into master (25d7471) will increase coverage by 0.02%. The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master     #850      +/-   ##
==========================================
+ Coverage   57.72%   57.74%   +0.02%     
==========================================
  Files          19       17       -2     
  Lines        2375     2381       +6     
==========================================
+ Hits         1371     1375       +4     
  Misses        841      841              
- Partials      163      165       +2     
dgibson commented 4 years ago

It looks like my depends-on tag broke the CI for reasona I don't understand yet. Removing it.

bpradipt commented 4 years ago

@dgibson shall we wait for https://github.com/kata-containers/agent/pull/849 to merge and then you can update this PR ?

dgibson commented 4 years ago

@dgibson shall we wait for #849 to merge and then you can update this PR ?

Done :)

dgibson commented 4 years ago

/retest-ubuntu

devimc commented 4 years ago

/test

dgibson commented 4 years ago

@jodh-intel new push has some unit tests, as requested.

jodh-intel commented 4 years ago

/test

dgibson commented 4 years ago

Thanks @dgibson. We do also like negative testing (assert.Error(err) for as many scenarios as possible), so feel free to consider adding additional tests for that.

I'm a bit disinclined to do heaps here at this time, since it will all have to be ported to Rust pretty soon anyway - and I have notions for longer term cleanups that may obsolete a bunch of it. I also think quite a lot of the Kata unit tests are so tightly coupled to the code as not to actually be all that useful (they end up testing how it's done, rather than what it's doing).

jodh-intel commented 4 years ago

/retest-ubuntu

devimc commented 4 years ago

@dgibson VFIO CI is not happy - can you add: Depends-on: github.com/kata-containers/runtime#2981 in your commit message?

dgibson commented 4 years ago

/test

dgibson commented 4 years ago

Temporarily setting DNM, while I try to track down the CI failure.

dgibson commented 4 years ago

/test-ubuntu

dgibson commented 4 years ago

/retest-ubuntu

dgibson commented 4 years ago

/test-ubuntu

dgibson commented 4 years ago

/test

dgibson commented 4 years ago

/test

dgibson commented 4 years ago

Urgh. After all that looks like the failure I was chasing is already present on the main branch.

dgibson commented 4 years ago

/retest

devimc commented 4 years ago

thanks @dgibson but VFIO CI is still failing :(

01:05:25 + sudo kata-runtime --kata-config /tmp/tmp.dwr2HumN0S/configuration.toml run --detach -b /tmp/tmp.dwr2HumN0S/bundle --pid-file=/tmp/tmp.dwr2HumN0S/pid vfiotest
01:05:34 rpc error: code = Unknown desc = PCI Identifier for device should be of format [bridgeAddr/deviceAddr], got 
01:05:34 ++ handle_error 86
dgibson commented 4 years ago

thanks @dgibson but VFIO CI is still failing :(

Yeah, I basically wasted a day chasing down the wrong failure :( (the one in jenkins-ci-ubunut-18-04-initrd). Now to figure out how to reproduce and debug the one in jenkins-ubuntu-18-04-vfio.

dgibson commented 4 years ago

/retest-vfio

dgibson commented 4 years ago

/test-vfio

dgibson commented 4 years ago

/test-vfio

dgibson commented 4 years ago

Rebased on #855, which we'll need to fix the CI failure on clh.

dgibson commented 4 years ago

/test-vfio

jodh-intel commented 4 years ago

/test-ubuntu /test-vfio

dgibson commented 4 years ago

/test-vfio

dgibson commented 4 years ago

/test-vfio

dgibson commented 3 years ago

Since the CI issues are proving so hard to debug, I've decided not to pursue this in Kata1, and leave it to Kata2.