Open fhoering opened 8 months ago
Hi Fabian Höring,
Thnx for the question and sorry for the delay. We're currently working on getting back to you,
Thank you, Alexander
Hi sorry for the long delay.
scripts/run-benchmarks --target //src/roma/benchmark:kv_server_udf_benchmark_test --benchmark_time_unit ms
Sample output -
Benchmark | Time | CPU | Iterations | UserCounters... |
---|---|---|---|---|
BM_LoadHelloWorld/0 | 11.2 ms | 0.047 ms | 1000 | bytes_per_second=596.362Ki/s |
BM_LoadHelloWorld/128 | 11.1 ms | 0.043 ms | 1000 | bytes_per_second=3.60729Mi/s |
BM_LoadHelloWorld/512 | 11.2 ms | 0.048 ms | 1000 | bytes_per_second=10.9137Mi/s |
BM_LoadHelloWorld/1024 | 11.2 ms | 0.045 ms | 1000 | bytes_per_second=22.4478Mi/s |
BM_LoadHelloWorld/10000 | 11.4 ms | 0.074 ms | 1000 | bytes_per_second=129.585Mi/s |
BM_LoadHelloWorld/20000 | 11.5 ms | 0.085 ms | 1000 | bytes_per_second=223.786Mi/s |
BM_LoadHelloWorld/50000 | 11.9 ms | 0.147 ms | 1000 | bytes_per_second=325.068Mi/s |
BM_LoadHelloWorld/100000 | 12.7 ms | 0.274 ms | 1000 | bytes_per_second=348.233Mi/s |
BM_LoadHelloWorld/200000 | 14.1 ms | 0.497 ms | 1374 | bytes_per_second=384.041Mi/s |
BM_LoadHelloWorld/500000 | 18.8 ms | 1.26 ms | 493 | bytes_per_second=378.279Mi/s |
BM_ExecuteHelloWorld | 0.905 ms | 0.027 ms | 10000 | items_per_second=37.3452k/s |
BM_ExecuteHelloWorldCallback | 0.982 ms | 0.028 ms | 10000 | items_per_second=35.8002k/s |
We would like to help you achieve an accurate measurement. If you are interested in collaborating to measure more please feel free to let us know what you think we may be able to help with.
Hello @peiwenhu,
Thanks for the information. I will come back to you with more information about the workloads.
About compîling data-plane-shared-libraries locally. Sorry for this question but I never used Google specific tooling like bazel before.
If I do this:
cd data-plane-shared-libraries
git checkout 89e8cf07e233779e92915fa6fbcd854f648e327c
What command do I need to execute to compile this ? How do I need to change the workspace file to actually pull my local sources ?
Hello!
What command do I need to execute to compile this ?
For running benchmarks, you can simply use the specified command -
scripts/run-benchmarks
(For running your benchmarks, you will need to modify kv_server_udf_benchmark_test and include the code you want to benchmark.)
If you run at HEAD for data-plane-shared-libraries, the following command can be used -
scripts/run-benchmarks --target //src/roma/benchmark:kv_server_udf_benchmark_test --benchmark_time_unit ms
How do I need to change the workspace file to actually pull my local sources
For running benchmarks, I don't think you have to modify your K/V server workspace. However, in general local_repository bazel rule can be used.
Let us know if any more information is needed from our side.
Thanks!
Thanks. I was able to execute the benchmarks like this:
./builders/tools/bazel-debian run //scp/cc/roma/benchmark/test:kv_server_udf_benchmark_test -- --benchmark_out=/src/workspace/dist/benchmarks/kv_server_udf_benchmark.json --benchmark_out_format=json --benchmark_time_unit=ms
./builders/tools/bazel-debian run //scp/cc/roma/benchmark/test:benchmark_suite_test -- --benchmark_out=/src/workspace/dist/benchmarks/benchmark_suite_test.json --benchmark_out_format=json --benchmark_time_unit=ms
I get the same results of ~1ms for executing an empty JS function.
I also ran the multi threaded roma workloads which give results of ~1000 request per sec.
An additional question on that. I never succeeded to get dropped requests even with a worker size of 1 and a queue size of 1. Is this expected ?
test_configuration.workers = 1;
test_configuration.inputs_type = InputsType::kSimpleString;
test_configuration.input_payload_in_byte = 500000;
test_configuration.queue_size = 1;
test_configuration.batch_size = 1;
test_configuration.request_threads = 30;
test_configuration.requests_per_thread = 1000;
Those tests seem to be done from the same machine which means the client can impact the server and vice versa. We probably have to run the full server and run the client with some web load injector like gatling to get representative results.
An additional question on that. I never succeeded to get dropped requests even with a worker size of 1 and a queue size of 1. Is this expected ?
In these benchmarks, for every request, we wait for the response individually. Hence, this behaviour is expected.
@peiwenhu @lx3-g @a-shruti
We have done several web load tests with Gatling (ramp-up then constant load to 100k QPS over several steps and 120 secs each).
We deployed KV server version 0.16 on our own infrastructure in a container on 1 instance with 8 cores (16 threads) & 16 GB of memory with the following components:
Javascript UDF code:
We then implemented the same basic logic in a c# vanilla asp.net server and executed the same web load test for comparison.
We have deployed and benchmarked the provided c++ => WASM sample file from here (file size: 100 KB)
We also tried out the provided Microsoft templates to compile c# to WASM (dotnet 9 required, file size: 30 MB, contains c# runtime)
dotnet workload install wasi-experimental
dotnet new wasiconsole -o Ans43
dotnet build -c Release
The specification has some explanations here on how JS and WASM workloads would be handled by the UDF execution engine: https://github.com/privacysandbox/protected-auction-services-docs/blob/main/bidding_auction_services_system_design.md#adtech-code-execution-engine https://github.com/privacysandbox/data-plane-shared-libraries/tree/main/scp/cc/roma
This design looks interesting and I'm trying to find out if it would be able to handle workloads with thousands of QPS per instance and 10ms latency. I'm wondering in particular how it would work with managed languages like c# or Java compiled to WASM.
From what it understand there will be N pre allocated workers each able to handle single threaded workloads.
The doc mentions this part about JS
What does recreating the context exactly imply in terms of performance ?
My understanding is that compiling c#/Java to WASM works like a self contained executable which means the runtime needs to be embedded inside the WASM file. If the runtime and garbage collector would be initialized all the time for each request the overhead is very probably prohibitive for workloads mentioned above.
Can you provide more information on how JS and WASM (java, c#) workloads would be handled exactly with the UDF execution and if this could handle the workloads mentioned above ?