google / fleetbench

Benchmarking suite for Google workloads
Apache License 2.0
116 stars 10 forks source link

Fidelity about fleetbench #11

Closed Jerry-Tianchen closed 11 months ago

Jerry-Tianchen commented 1 year ago

Hi:

We are using the fleetbench for a research topic. May I get some design ideas on how fleetbench protocol-buffer benchmark is designed? For example

  1. Why is kIteration set to 10?
  2. In ProtoLifecycle::Run(), Is there any reason why the logic looks like this? For example: After a package m493messages[0] is de-serialized (like those in line 196), its de-serialized data is not used immediately. Rather, the logic keep de-serializing other 9 packages -- m493messages[1-9]. Even after all 10 packages are de-serialized, the logic is followed by a copy of m1messages. When CPU tries to use data from m493strings[0], data might not be in the cache. In a real protocol buffer application, I would image that a package is de-serialized, then its de-serialized data is used by CPU immediately for further work(like being modify and then serialized into another package). In this case, most of the data is still in cache. Due to that, I am concerning if the cache behavior in fleetbench protocol buffer benchmark might be really different with a real protocol buffer based application...Thus I am worried about the Fidelity... Please correct me if you think I am wrong.

We am interested to know the fidelity of fleetbench protocol-buffer benchmark, comparing to a real protocol-buffer based application like RPC.

Thank you~ Jerry

rjogrady commented 1 year ago

Hi Jerry, Great questions. Sorry for the late reply.

In short, the higher kIteration is, the more dcache misses we see, and fewer icache misses. The number 10 is a little arbitrary but was picked to induce what we found to be a realistic number of misses for both. Sorry I can't be more specific just now.

As you say, typically an application will do some processing on each protobuf message in turn, but it's not solely concerned with proto API operations. The working set size of 10 is trying to simulate that in such a way that the cache metrics look more like a normal application, while still spending most of the benchmark's CPU cycles in proto operations. It's a bit of a balancing act.

We are actively working on improving our fidelity in the proto benchmark, though, as well as the documentation. We'll try to indicate the rationale here more clearly.