microsoft / diskspd

DISKSPD is a storage load generator / performance test tool from the Windows/Windows Server and Cloud Server Infrastructure Engineering teams
MIT License
1.15k stars 214 forks source link

[Question / Improvement] Is it "bad" to store the gold image VHDX on the test workload CSVs? #218

Open eponerine opened 6 days ago

eponerine commented 6 days ago

TL;DR - Are there performance ramifications for leveraging ReFS Block Cloning for the VM/VHDX deployments? Will those reads all referencing the same blocks skew results in a negative way?

=====

For the sake of keeping my question simple, let's assume Single Node S2D with a single workload CSV.

If you place your GOLD.VHDX on the C:\ClusterStorage\collect volume, it takes an eternity to spawn and deploy all your VMFleet VMs, especially if your VHDX is like 100+ GB. It has to copy n number of times to C:\ClusterStorage\nodeName

However, if you place your GOLD.VHDX on C:\ClusterStorage\nodeName, the CSV you're ultimately testing, those VMs will deploy exponentially faster due to the magic of ReFS block cloning.

I feel like this is a bad idea though, because the blocks essentially only exist a single time, with pointers back to the legit physical disks. The n VMs then are potentially not "spreading the read load" out to more physical disks?

dl2n commented 6 days ago

Yes, spreading the gold image to the CSV would speed up the time to VM first light. It would reopen problems the copy solves, but those problems may be re-emerging with 2025 in a way we need to control for - thanks, this is a useful thread.

Maybe over-summarizing it, there are three buckets of blocks in those VHDX.

  1. blocks backing the OS image which are ~readonly/static
  2. blocks written by the running OS image (logs, early lifetime self-initialization, etc.)
  3. blocks r/w by the VM Fleet/DISKSPD load

The first set I'd say - that could make all sorts of sense. It would accelerate actual deduplication (new in 2025) via the copyfile cloning. VM Fleet does not currently control for this, which is a gap that may need addressing.

The second set is harder. There's a major chunk of this already, depending on how "old" your gold image is. If it was only run a short time while setting the admin password in the actual "gold" from-the-build image, there's actually quite a lot of early lifetime work that happens including initial JIT of .NET assemblies which may not have happened yet. My own personal gold images suffer a bit from this, and I need to be careful to let that JITting start/complete before I do anything with my fleets. Its also costly CPU-wise (esp. on 1 VCPU VMs). Beyond that is the normal activity of the OS services we try to pass off as "small" but is very real. Starting in cloned extents, the overhead of reallocating those would add on top to possibly become noticeable early on.

The third set is quite tricky. Measure-FleetCoreWorkload internally takes care of seasoning the load files used in the VM but "core" VM Fleet leaves this to the analyst's control. If we copy'd the VHDX all the blocks are cloned at time zero and the overhead of splitting - on write - would confound early analysis of system behavior. It could be exactly what we want in some cases! But it would be there. On read, yes, the sharing could focus load in strange ways until the writes cause things to split (hopefully your loads have writes, and run in a consistent order ... all this kind of complexity would factor in).

Copying (volume to volume, i.e. collect to CSV) guarantees nothing is cloned/dedup'd at time zero and everything has been written through the cache at least once. This drastically reduces the chance of early lifetime errors.

Now, admittedly - those errors could be mitigated by a specific time zero process of seasoning the initial load files in the VMs; i.e., light all the VMs up and rewrite the load files 2x with random data. This might even be a good idea. We'd still have the second set of blocks (OS live workingset) but that might just be acceptable. And a specific seasoning capability would be good methodology to lay in.

Advocate to your favorite MVP PM :-) But for now, I'd roughly assume that the copy time might only be somewhat greater than the time it would take to reseason the fleet, and we don't have to think a lot about #1/2 above.