openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.76k forks source link

ZTS: Fix zpool_status_008_pos false positive #16769

Closed behlendorf closed 1 week ago

behlendorf commented 1 week ago

Motivation and Context

Resolve false positives reported by the CI for the zpool_status_008_pos test case.

https://github.com/openzfs/zfs/actions/runs/11848141397/job/33027753860?pr=16755 https://github.com/openzfs/zfs/actions/runs/11848141397/job/33027755132?pr=16755

Description

~When checking that healthy vdevs[1-3] aren't shown omit the -s flag so slow vdevs are not considered as described by the comment. This avoids the possibility of a false positive in the CI when ZIO_SLOW_IO_MS is reduce to 160ms. Additionally, clear the fault injection as soon as it is no longer required for the test case.~

Increase the injected delay to 1000ms and the ZIO_SLOW_IO_MS threshold to 750ms to avoid false positives due to unrelated slow IOs which may occur in the CI environment. Additionally, clear the fault injection as soon as it is no longer required for the test case.

How Has This Been Tested?

Updated only the test case which will be tested by the CI.

Types of changes

Checklist:

amotin commented 1 week ago

I wonder if the actual problem of the test is that supposedly healthy devices really do not fit into the timeout? Sure your patch should "work", but it might just hide valid failure case when vdevs are falsely reported slow, that is tested now. I wonder if we should increase the timeouts instead.

behlendorf commented 1 week ago

I wonder if the actual problem of the test is that supposedly healthy devices really do not fit into the timeout?

That's exactly my suspicion. I'm game to try increasing the timeouts first and see if that resolves it. There's not going to be any right value so we'll have to pick something and see how it goes. We may end up just making the failure less likely, but not impossible.

Edit: updated PR to set slow IO threshold at 750ms and inject 1000ms delays.

behlendorf commented 1 week ago

I've run this through the CI 3 times now without reproducing this failure which is a good sign. Only after we merge this will we know for certain if the larger values are enough to resolve this entirely.