intel / pmem-csi

Persistent Memory Container Storage Interface Driver
Apache License 2.0
164 stars 55 forks source link

use error injection to verify operation in case of media issues #799

Open okartau opened 3 years ago

okartau commented 3 years ago

We have not had systematic approach to verify/improve the pmem-csi behavior on top of media errors. As we deal with file system formatting and mounting, the possible media-level errors are typically serious. Without an human operator interpreting error messages, the pmem-csi plugin can't do much to fix and continue, so it's likely "fail and stop". But at least we should try to make sure that the result is not too ugly, like crashing without helpful message, looping forever, etc.

So far, error handling has been mostly based on scenarios that have been detected through testing and use. To improve the coverage, we could use artificially generated errors. There is ndctl-inject-error but it seems HW-specific. We could also investigate can we just corrupt media in emulated use case and see what happens in next run. Also worth investigating, is coverage in emulated cases as good (i.e. bad) as HW-based corruption can be. I am not convinced should we add such testing (probably quite slow) into CI cycle, which is already lasting long. But having some tools to run some tests out-of-CI (also, on HW) would be helpful.

pohly commented 3 years ago

The relevant fault injection for PMEM-CSI is when "mount" or "mkfs" fail. If file access fails at application runtime, then there isn't much that PMEM-CSI can do about it. We don't even get to know about it.