ZTS: Fix zpool_status_008_pos false positive

behlendorf commented 1 week ago

Motivation and Context

Resolve false positives reported by the CI for the zpool_status_008_pos test case.

https://github.com/openzfs/zfs/actions/runs/11848141397/job/33027753860?pr=16755 https://github.com/openzfs/zfs/actions/runs/11848141397/job/33027755132?pr=16755

Description

~When checking that healthy vdevs[1-3] aren't shown omit the -s flag so slow vdevs are not considered as described by the comment. This avoids the possibility of a false positive in the CI when ZIO_SLOW_IO_MS is reduce to 160ms. Additionally, clear the fault injection as soon as it is no longer required for the test case.~

Increase the injected delay to 1000ms and the ZIO_SLOW_IO_MS threshold to 750ms to avoid false positives due to unrelated slow IOs which may occur in the CI environment. Additionally, clear the fault injection as soon as it is no longer required for the test case.

How Has This Been Tested?

Updated only the test case which will be tested by the CI.

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[x] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[x] I have updated the documentation accordingly.
[x] I have read the contributing document.
[x] I have added tests to cover my changes.
[ ] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

amotin commented 1 week ago

I wonder if the actual problem of the test is that supposedly healthy devices really do not fit into the timeout? Sure your patch should "work", but it might just hide valid failure case when vdevs are falsely reported slow, that is tested now. I wonder if we should increase the timeouts instead.

behlendorf commented 1 week ago

I wonder if the actual problem of the test is that supposedly healthy devices really do not fit into the timeout?

That's exactly my suspicion. I'm game to try increasing the timeouts first and see if that resolves it. There's not going to be any right value so we'll have to pick something and see how it goes. We may end up just making the failure less likely, but not impossible.

Edit: updated PR to set slow IO threshold at 750ms and inject 1000ms delays.

behlendorf commented 1 week ago

I've run this through the CI 3 times now without reproducing this failure which is a good sign. Only after we merge this will we know for certain if the larger values are enough to resolve this entirely.

openzfs / zfs