utsaslab / crashmonkey

CrashMonkey: tools for testing file-system reliability (OSDI 18)
Apache License 2.0
193 stars 31 forks source link

Create subset permuter and ability to work with bio sectors #96

Closed ashmrtn closed 6 years ago

ashmrtn commented 6 years ago

The subset permuter works with subsets of bios or sectors (irrespective of order) instead of unique permutations of bios or sectors. This should help reduce the number of crash states that need exploring because states that have the same bios or sectors in the epoch that was crashed in should be equivalent (as long as they don't overlap).

Additionally, implement a new mode that allows CrashMonkey to work with sectors of the bios instead of complete bios. When working with sectors, the -S <size> flag can be used to control the size of the sectors the bios are split into. To go back to working with full bios only, give the -F flag.

vijay03 commented 6 years ago

Hi Ashlie, have you used this mode to run any of the standalone tests we currently have? Do those work?

ashmrtn commented 6 years ago

I quickly tested with the rename_root_to_sub test that we have and it reported no inconsistency for that with several file systems. I will work on running it on other tests tomorrow.

ashmrtn commented 6 years ago

alright, so it looks like this patch fixed the time-travel problem we saw with the old permuter. However, it needs a little tweaking as it seems to be rather slow right now. I'll push more commits to that try to speed ti up some.

ashmrtn commented 6 years ago

@vijay03 are the following timing statistics acceptable?

For the revised version of the code, it takes ~5 minutes to run 10k tests on generic_042_fzero_keep_size with xfs on a 100mb ram disk:

permute time: 41036 ms
snapshot restore time: 0 ms
bio write time: 24487 ms
fsck time: 0 ms
test case time: 96 ms
mount/umount time: 157559 ms
total time: 270458 ms

I am not sure where the "missing" time is as I thought I had accounted for all the major operations.

Running with full bios like it used to runs only 27 tests and the timing is as follows:

permute time: 4 ms
snapshot restore time: 0 ms
bio write time: 24 ms
fsck time: 0 ms
test case time: 0 ms
mount/umount time: 423 ms
total time: 566 ms

Running an 27 tests with the sector permuter for comparison the timing is as follows (for easier comparison):

permute time: 97 ms
snapshot restore time: 0 ms
bio write time: 70 ms
fsck time: 0 ms
test case time: 1 ms
mount/umount time: 450 ms
total time: 756 ms
vijay03 commented 6 years ago

Trying to understand this a bit better, what is permute time and bio write time? Are we doing more mount and umounts because we are testing more crash states?

ashmrtn commented 6 years ago

ah, yes, sorry. Permute time is the time spent generating unique crash states (so basically time spent running the subset permuter and the bit of code in Permuter.cpp that makes sure the subset permuter hasn't returned that exact crash state before).

Yes, we are doing more mounts/unmounts because we are testing more crash states. A rough metric of how much overhead was added can be seen by comparing the last 2 sets of timing metrics.

I forgot to mention that this was done using 512B sectors. Changing this value may affect the time it takes to run to a certain extent because it can affect the number of possible crash states by affecting the number of sectors that the subset permuter works with.