h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Add distributed performance test for NBHM/DKV #14684

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Would be nice to know how many keys we can insert/delete/modify/etc. per second on a given set of hardware, load, memory fill rate, etc.

exalate-issue-sync[bot] commented 1 year ago

Raymond Peck commented: This is one of many bottom-up tests that we really ought to have. I'm sure [~accountid:557058:6790afb9-728d-4398-a865-75fa28f915df], who has been advocating for more platform tests, has lots more ideas about what else to test.

exalate-issue-sync[bot] commented 1 year ago

Bill Gallmeister commented: Navdeep, I think these issues are related to the performance/accuracy tests and harness.

exalate-issue-sync[bot] commented 1 year ago

Raymond Peck commented: According to [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223]:

Eric started writing some micro-benchmarks of NBHM and comparing to ConcurrentHashmap. The tests are in h2o-3 branch eric_test_perf, and currently are only single-JVM. These benchmarks need to be made multi-system, and eventually run on empty nodes with a dedicated network. He used Maven to run the tests.

Michal ported one of these tests to our standard Gradle infrastructure in branch EXP_micro_bench.

These tests use a benchmarking framework from OpenJDK called jmh.

He proposes collecting the system data using this telemetry server: [grafana+graphite|https://github.com/kamon-io/docker-grafana-graphite]

Steps:

  1. Look at these two branches and understand the pieces. Be able to run the existing tests in each branch.

  2. Port Eric's other NBHM tests to EXP_micro_bench.

  3. Wire up H2O to grafana+graphite for collecting test telemetry.

  4. Get the tests running multi-machine. Use Steam for launching the clusters?

  5. Figure out how to accurately measure our performance.

  6. Do we need warmup? Does speed degrade over time? How does NBHM perform in the face of resize? Many readers and 1 writer? Many writers? Various levels of load? Network congestion (use the Linux equivalent of dummynet).

  7. Possibly wire up Zipkin to measure distributed performance metrics.

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-1720 Assignee: Surekha Jadhwani Reporter: Arno Candel State: In Progress Fix Version: N/A Attachments: N/A Development PRs: N/A