microsoft / StorScore

A test framework to evaluate SSDs and HDDs
http://aka.ms/storscore
MIT License
81 stars 34 forks source link

This changeset includes a few major changes to support testing multiple files on a device under test with a different workload running on each file. Such a workload is important for evaluating streaming and open channel SSDs. #46

Open lauracaulfield opened 7 years ago

lauracaulfield commented 7 years ago

The test operator passes the following command line to test disk 2 using a multi-target workload:

$> StoreScore.cmd --target=2 --recipe=multitarget.rcp

As with previous versions of StorScore, the command line defines the device under test, and the recipe file defines the workload. New with these changes, it also defines the number of files to create and test. For example:

multitarget.rcp:

test( description => "Multitarget Sample" target_count => 2, xml_profile => "sample.xml", initialize => 0, warmup_time => 5, run_time => 60, );

Unlike single-target tests, multi-target tests define the workload in a separate xml profile template. StorScore checks to make sure the recipe points to a profile file whenever the target count is greater than 1.

During each test, StorScore wipes the disk, creates a partition and file system, and then divides the disk space evenly among the number of targets given by the recipe. It injects a few variables into the xml template, such as file names and run time, exports the xml document to the results directory and then points diskspd at this profile. (Diskspd.exe -Xmodified-sample.xml)

NB: In the future, I would like to generate the xml document from scratch to reduce user error and frustration. The XML profiles can be tedious to generate, and they can easily get out of synch with the recipe.

These changes also update the parsed data. Previously, the excel file printed one row for each test:

Disk | Description | W Mix | IO Size | MB/s | Avg Latency

PM953 | 4k Rand Read | 0 | 4k | 502.45 | 0.092 PM953 | 4k Rand Write | 100 | 4k | 235.45 | 1.201 PM953 | 8k Rand Read | 0 | 8k | 252.45 | 0.181

Now, each row shows either per-target workload and results or the aggregate workload and results:

Disk | Description | Target | W Mix | IO Size | MB/s | Avg Latency

PM953 | 4k Rand Read | Total | 0 | 4k | 502.45 | 0.092 PM953 | Example Multi | Total | 50 | | 235.45 | 1.201 PM953 | Example Multi | E:\file1.dat | 0 | 4k | 285.12 | 0.181 PM953 | Example Multi | E:\file2.dat | 100 | 8k | 102.48 | 1.589

The outlier and score calculation is based only on the measurements listed in the aggregated rows.

To support this output format, I added a level to the parser's main data structure:

Before:

stats = { “FW version” => “10010L00”, … “Read Mix” => 100, “QD” => 1, … “MB/s Read” => 1175, “IOs Read” => 20398, … }

After: stats = { “FW version” => “10010L00”, … “Workloads” => { “Total” => { “Read Mix” => 50, “QD” => 129, … }, “E:\file1.dat” => { “Read Mix” => 100, “QD” => 1, … }, “E:\file2.dat” => { “Read Mix” => 0, “QD” => 128, … } }, “Measurements” => { “Total” => { “MB/s Read” => 285.12, “IOs Read” => 10128, … }, “E:\file1.dat” => { “MB/s Read” => 285.12, “IOs Read” => 10128, … }, “E:\file2.dat” => { “MB/s Read” => 0, “IOs Read” => 0, … } } }

msftclas commented 7 years ago

Hi @lauracaulfield, I'm your friendly neighborhood Microsoft Pull Request Bot (You can call me MSBOT). Thanks for your contribution!

It looks like you're a Microsoft contributor. If you're full-time or an intern, we DON'T require a Contribution License Agreement. If you are a vendor, please DO sign the electronic Contribution License Agreement. It will take 2 minutes and there's no faxing! https://cla.microsoft.com.

TTYL, MSBOT;

marksantaniello commented 7 years ago

Does this enable us to test multiple files on a single volume, or multiple volumes?

The terminology is starting to get confusing. If the command line specifies --target=2 (a single target, disk number 2) how can the recipe then specify multiple targets?

I looked at one of the XML files and it doesn't seem to specify a file name, just a list of unnamed targets. I guess this works when you want StorScore.cmd to test multiple files on a single drive?

Can you test what happens when you use a recipe like this and target an existing volume or file (--target=D:\ or --target=D:\myfile)? Both of those use cases are supported today.

Is it possible for the targets specified in XML to conflict with the command line --target? What if the recipe "target_count" property doesn't match the XML file? Why not parse the XML and just count the targets, instead of requiring the user to do this themselves? Seems like an easy way to avoid a whole class of bug.

Today we have some guard rails to prevent a user from accidentally destroying the wrong disk. Does that all still work with this?

lauracaulfield commented 7 years ago

That's really odd... GitHub lets me edit your comment, but doesn't let me reply directly to it. Hmmm... what embarrassing thing can I make Mark say on the internet... j/k :-)

This is a big feature to add to StorScore, and I'd like to work out better terminology. But, it's a useful feature for evaluating a new functionality ("Streaming") that's coming to an SSD near you. I think I need to do a better job of explaining why StorScore should be able to test that feature.

The top-level goals are the same as always

  1. apply a set of tests ("recipe") to a single drive with one click and a large amount of computer time
  2. process that big pile of data and compare how different drives did on the same set of tests (with varying weights)

The technology change that's motivating the change to StorScore is called streaming (there's some information about it here: https://www.usenix.org/system/files/conference/hotstorage14/hotstorage14-paper-kang.pdf). Basically, the streaming interface allows the host and drive to communicate about which data should be grouped together on the SSD. This is really helpful in preventing GC when you're, say, writing each of 4 files sequentially and simultaneously. In a normal SSD, the 4 streams mix together and create a high WAF. In a streaming drive, the controller places the data from each stream in it's own set of blocks. Each stream can write at it's own rate, and the host writes will invalidate all the data in a block at roughly the same time, and WAF will be 1.

Now, enter StorScore. The goal is to be able to define a streaming test (like the 4-file example above) in the recipe, and be able to parse the results and compare them across drives. In the parsed data, I would like to be able to see the performance of each stream, the aggregate performance, and the WAF (of the whole drive) to validate that the streams are implemented properly.

Now, for some specific answers. The multi-stream/target/file test is still meant to evaluate a single volume or disk -- the volume or disk specified on the command line. Volume and disk targets are supported for this type of test, but file is not. When the user targets a file with this type of test, StorScore should give a useful error message.

The XML "template" file doesn't list file names (and a few other parameters like the test's run time) because StorScore generates those and adds them to the final profile before running the test. StorScore gets the size of the target given on the command line, and divides this space evenly among the number files given in the test, and creates the files. This way StorScore can still purge between tests when the target is a disk. Also, the user doesn't have to know about the valid data length flag or do preconditioning manually.

If the number of files (currently "target_count" in the recipe, but I'm seriously thinking about renaming it) doesn't match the number of targets in the XML file, StorScore errors. I prefer to generate XML rather than parse it, because I want StorScore defining the workload rather than the manually generated (read: error prone) XML files. Ultimately, I would like StorScore to generate the entire XML file from scratch. I'm using the template idea temporarily until we get a better feel for the knobs we want in this type of test.

Regarding this: Today we have some guard rails to prevent a user from accidentally destroying the wrong disk. Does that all still work with this?

Are you talking about the initial message "are you sure you want to destroy..."? If so, yes. This is still functional for all types of tests.

This is a big design change, but from my perspective it fits well within our original goals. I'm surprised it didn't require more changes as well, but after working with it, I think that's just because it has a structure that matches our broader goals -- even the ones we hadn't defined yet. I tested with a conventional test (for StorScore changes) and conventional set of data (for parser changes), so I'm as confident in backward-compatibility as I can be without a regressions test.

marksantaniello commented 7 years ago

I guess DiskSpd's XML file uses the word "target", which is adding to the confusion over terminology. Would things make more sense if we kind of ignored that and used the word "stream" in some places? So we still test one "target" with StoreScore, but that "target" can potentially have many "streams" ?

lauracaulfield commented 7 years ago

Stream is a fairly overloaded term I try to avoid. It might be ok here, though, since StorScore isn't using the term for anything else. Whatever term we use, I think a clean separation from diskspd's terminology will help.

Some options: sub-target stream file channel <-- also over loaded

Technically speaking, stream is the SSD-level name. You may not always have a one-to-one mapping between stream and file. Even so, I think stream is still my favorite.

marksantaniello commented 7 years ago

So... one more idea I'd like to suggest to you:

If there's some tension between a "nice design" and "getting something working now", you could just fork StorScore and create StreamingStorScore, for the latter, to break the dependency.

The fancier design could specify multiple streams directly in our recipe format, and delegate to the IO generator how to accomplish it. For DiskSPD or (someday, I hope) FIO, maybe you would have some code to construct a configuration input file (XML, in the case of DiskSPD). This could live in DiskSpdRunner/SqlioRunner/FioRunner because it's pretty analogous to the code that translates our recipe into the right command line flags. If the IO generator doesn't support multiple streams natively (like SQLIO) maybe you could simulate it by launching multiple child processes (similar to the idea of launching multiple precondition.exes).

By forking, you get something out today that vendors can use, etc., but avoid taking on a lot of tech debt that will get in the way later.