Improve "apply-load" to cover more network settings and scenarios

MonsieurNicolas commented 1 month ago

Tagging as a discussion for now.

Right now it's fairly difficult to validate proposed ledger wide limits, yet having a good understanding of their impact can really help with

validating that proposed settings are safe to use on some "minimum spec" hardware
understanding ledger wide settings in the context of future changes (scheduling)
validating that transaction level limits are "sane" under different scheduling assumptions (ie: reason over "phases" instead of "ledger")
validate settings under various ledger conditions (such as predicted ledger size when adjusting the "target size")

Here is a high level way we could iterate on bench-marking capability in this context:

we should document how to build a "reference min-spec node" that can be used to run those benchmarks. Ideally we would be able to control certain things like the CPU and IOPS capabilities.
- I think this can be achieved with docker with options like --device-read-iops=1000 --device-write-iops=1000 for iops; and things like --cpuset-cpus=0-3 --cpu-period=100000 --cpu-quota=40000 for CPU (as to throttle CPU performance to the desired number -- in my example 10% of the host performance)
contract invoked by apply-load (do_work at the moment) should get a new method that performs more interesting work.
- Arguments are: seed: int (seed for prng), other args related to the amount of work (amount of cpu work, number of reads, number of writes, bytes per read or write, bytes of events to generate, key padding size).
- ensure meta generation is enabled with apply-load (METADATA_OUTPUT_STREAM or METADATA_DEBUG_LEDGERS enabled) as that method emits events
- data keys have an ID that looks like contractID | zeros[padding_size] | counter (so that they can be sorted by counter)
- the contract internally will read entries, then write entries, generating a pseudo random key for each (ie: counter is generated by a prng seeded with seed)
ApplyLoad
- uses the same prng code which allows it to specify read and write footprints
- ledger state is generated by adding contract data that will correspond to all IDs reached during a run.
  - generation can just be a loop counter:0..N that picks a random bucket and simply appends to that bucket (as keys are sorted). Goal is to create a fairly realistic distribution (in terms of bucket size) -- an exponential distribution is probably good enough.
- we may want to change the way ApplyLoad works so that ledger state always starts from the generated snapshot
  - basically, ApplyLoad::benchmark() should always apply transactions over the same state (so same ledger number), by resetting its state after applying a ledger
  - this allows to get rid of the churn on low level buckets that introduce noise across benchmark runs (both because of the spilling of levels, and amount of work required sync/async for those spills to occur)
  - at a minimum we should do something similar to what CATCHUP_WAIT_MERGES_TX_APPLY_FOR_TESTING does when running "catchup" so that benchmark step boundaries are clean

dmkozh commented 4 weeks ago

we should document how to build a "reference min-spec node" that can be used to run those benchmarks. I

That's a good idea, but I'm not sure which spec this should be. I'm not sure we should hinder ourselves too much, given that validators should have much more reasonable specs.

should get a new method that performs more interesting work

There are some issues with this contract, but I'm not sure if they're about not 'interesting' enough work. I don't think randomization on the contract side is necessary. I'm also not even sure if we should modify any entries on the contract side at all, it might be better to just pass them through (at least for now; eventually we might start filtering out no-op writes). Basically all the IO happens based on the footprint and it's not very useful to do anything about the ledger entries on the host side. A big issue about this benchmark so far has been variance, I would work on eliminating it as much as possible before trying anything fancy.

ledger state is generated by adding contract data that will correspond to all IDs reached during a run

Yeah, I've been thinking about BL emulation as a next step. Appending directly to the buckets is an interesting idea, I didn't know that's possible. I've no clue as to how much that will change the results though.

basically, ApplyLoad::benchmark() should always apply transactions over the same state (so same ledger number), by resetting its state after applying a ledger

That's a great idea, thanks.

MonsieurNicolas commented 4 weeks ago

That's a good idea, but I'm not sure which spec this should be. I'm not sure we should hinder ourselves too much, given that validators should have much more reasonable specs.

I think the issue is that people's laptops are likely much faster at a lot of things than the typical validator in AWS and in general we need a way for everyone to normalize numbers so that we can have sane conversations.

The actual docker args cannot be hardcoded (at least CPU seems to be relative to the host CPU), but we can and should document the absolute numbers in terms of IOPS and core performance (probably look at some raw number easy to benchmark like X sha256/second so that people can easily tweak their setup to get close enough to that performance) . Note that we may already have some instructions/second "basic validator" number as we need something like that in the context of calibration. IOPS -> at some point we'll shoot for 1M IOPS, but for now conservative number should be more like 100k IOPS.

Basically all the IO happens based on the footprint and it's not very useful to do anything about the ledger entries on the host side. A big issue about this benchmark so far has been variance, I would work on eliminating it as much as possible before trying anything fancy.

yeah I don't think you actually need to modify entries within the contract, but I am pretty sure you need to call the put host function (how entries end up in the "modified" list) -- otherwise we'd allow contracts to rewrite entries they don't own.

dmkozh commented 3 weeks ago

but I am pretty sure you need to call the put host function (how entries end up in the "modified" list)

Nope, currently we actually overwrite all the RW entries unconditionally, the security guarantee comes from the host not allowing contracts to modify entries they don't own. We might start filtering out no-op changes eventually, but I'm not sure if that's a useful optimization for the real traffic.

sisuresh commented 4 days ago

@MonsieurNicolas you mentioned using --cpu-period=100000 --cpu-quota=40000 to limit resources as an example. If the docker image is being run on one of our dev machines (same hardware as our validators) is there any reason to limit CPU? I don't think so. I couldn't get the iops limiting args to work on my laptop, which is why I'm using a dev machine.

MonsieurNicolas commented 3 days ago

Reason for doing something like throttling is that we want to be able to both normalize tests (so that people can try things out on their laptop or any hosted environment) and to get a reasonable estimate of performance for both "current proposed" and "future safe" on baseline hardware.

CPU line performance and IOPS are both problematic: if you test on hardware that has better capabilities than baseline, it will give a false sense of performance, so we need to mitigate for this.

I think that throttling is not the only acceptable solution.

For example if you can estimate the ratio, it may be possible to add artificial work. If you know that your machine is 3x faster than the baseline, you can scale the amount of work that transactions perform by 3x (or add special "padding" synthetic work at a few place for both CPU and IO). I think we need to come up with simple synthetic benchmarks for CPU and IO that allow to compute the ratio to baseline. A IO benchmark test could be as simple as generating (deterministically) a large bucket list (in the order of GB) and count the number of random ledger entries that can be read over a fixed period of time like 5 minutes without any cache enabled (in application and disk access).

stellar / stellar-core

Improve "apply-load" to cover more network settings and scenarios #4520