h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

H2O is slower than expected in it's network performance (and overall perf), when the network is busy due to others (doesn't seem to get a fair amount of bw, or is running in degraded mode) #14453

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I have a way of creating network bw with iperf from N to N nodes simultaneously using iperf. I can dial in the type (UDP or TCP) and bandwidth that I want to sustain.

This issue is really best targeted with multi-machine testing. multi-jvm on a single node could stress localhost traffic with others, but since localhost bw is not fixed (could be large) and we tune mtu on localhost because of another h2o gradle test problem, it would be iffy to try to get data from multi-jvm tests.

the only multi-node testing we do is with h2o on hadoop, so that's how h2o is running here.

Currently it's in this jenkins job (which may be modified back to it's normal behavior)

http://mr-0xb1:8080/view/MedLargeUbuntu/job/hdp2.2_hadoop2.6_nightly/

It's unclear what "other network bandwidth" we should be testing to. Today it's "0". Or we can force a predicted drop rate (anything > 0.5% drop in UDP seems to cause too big a slowdown for testing. So that's a ballpark for where h2o is senstive to OS drop rates

Ideally, if the available bw on the network is some multiple of what h2o needs, the increased latencies due to network busyness shouldn't degrade h2o performance too much.

I've done packet loss tests with iptables, that show small % loss in UDP causes dramatic slowdown in h2o (just was targeting gradle build junit) i.e. the difference between 0.5% loss and 0% loss was 50% slower

This test tried to explore what bw h2o gets (what latencies) when it's dealing with a busy network. The network is 100% busy with low cpu needs, so we're not cpu limited here.

I can adjust the test based on what the goals are. At the very least this shows that h2o's network performance can get severely degraded. Understanding the full set of cases that degrade h2o, and by how much, can be done, but we need to define the goal. We currently test h2o essentially on an idle network..i.e. the only traffic is due to h2o (or other h2o clouds on the same network). We don't have any performance checking for those multi-h2o cases (or even really know whether worst case is hit)

h2o is stuck in a GBM build model..at least slow Job Type Model Key GBMModel__b447a42069a021e351476978cd544711 Description GBM Status RUNNING Run Time 00:35:12.256 Progress
29%

Scoring the model.

here's the h2o stdout for network test while the network is pegged and GBM running above run from 172.172.2.235

You can see the performance is low

So there's a fairness question, and a congestion question...i.e. when the network is busy, what does h2o get? and if there are drops, does h2o reduce it's thruput even more, so that it stays in low performance mode ..or ??

the network test while busy: (at least it completed!)

http://172.17.2.235:54321/flow/index.html#

06-20 20:04:08.993 172.17.2.235:54321 10128 # Session INFO: Network Test (Launched from /172.17.2.235:54321): 06-20 20:04:08.993 172.17.2.235:54321 10128 # Session INFO: Destination 1 bytes 1024 bytes 1048576 bytes 06-20 20:04:08.993 172.17.2.235:54321 10128 # Session INFO: all - collective bcast/reduce 286.437 msec, 34 B/S 271.032 msec, 36.9 KB/S 12.648 sec, 809.6 KB/S 06-20 20:04:08.993 172.17.2.235:54321 10128 # Session INFO: remote /172.17.2.231:54321 7.676 msec, 260 B/S 4.704 msec, 425.2 KB/S 660.926 msec, 3.0 MB/S 06-20 20:04:08.993 172.17.2.235:54321 10128 # Session INFO: remote /172.17.2.232:54321 26.616 msec, 75 B/S 83.915 msec, 23.8 KB/S 359.092 msec, 5.6 MB/S 06-20 20:04:08.993 172.17.2.235:54321 10128 # Session INFO: remote /172.17.2.233:54321 6.392 msec, 312 B/S 2.785 msec, 718.0 KB/S 1.390 sec, 1.4 MB/S 06-20 20:04:08.993 172.17.2.235:54321 10128 # Session INFO: remote /172.17.2.234:54321 5.714 msec, 349 B/S 4.942 msec, 404.7 KB/S 1.434 sec, 1.4 MB/S 06-20 20:04:08.993 172.17.2.235:54321 10128 # Session INFO: self /172.17.2.235:54321 81 usec, 24.0 KB/S 37 usec, 52.5 MB/S 26 usec, 73.14 GB/S 06-20 20:04:09.302 172.17.2.235:54321 10128 # Session INFO: Method: GET , URI: /3/NetworkTest, route: /3/NetworkTest, parms: {} 06-20 20:04:11.899 172.17.2.235:54321 10128 FJ-0-9 INFO: 14. tree was built in 00:03:41.554 (Wall: 20-Jun 20:04:11.899)

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: here's another run of the network test, same situation

O: Totals 239293 14760707 0.4938 = 7,406,464 / 15,000,000 06-20 20:19:53.948 172.17.2.235:54321 10128 FJ-0-9 INFO: Total of 7406464 errors on 15000000 rows 06-20 20:22:55.055 172.17.2.235:54321 10128 # Session INFO: Network Test (Launched from /172.17.2.235:54321): 06-20 20:22:55.055 172.17.2.235:54321 10128 # Session INFO: Destination 1 bytes 1024 bytes 1048576 bytes 06-20 20:22:55.055 172.17.2.235:54321 10128 # Session INFO: all - collective bcast/reduce 264.604 msec, 37 B/S 346.221 msec, 28.9 KB/S 12.722 sec, 804.9 KB/S 06-20 20:22:55.055 172.17.2.235:54321 10128 # Session INFO: remote /172.17.2.231:54321 6.287 msec, 318 B/S 6.533 msec, 306.1 KB/S 821.454 msec, 2.4 MB/S 06-20 20:22:55.055 172.17.2.235:54321 10128 # Session INFO: remote /172.17.2.232:54321 5.467 msec, 365 B/S 5.119 msec, 390.7 KB/S 1.274 sec, 1.6 MB/S 06-20 20:22:55.055 172.17.2.235:54321 10128 # Session INFO: remote /172.17.2.233:54321 8.721 msec, 229 B/S 213.703 msec, 9.4 KB/S 1.751 sec, 1.1 MB/S 06-20 20:22:55.055 172.17.2.235:54321 10128 # Session INFO: remote /172.17.2.234:54321 4.518 msec, 442 B/S 4.687 msec, 426.6 KB/S 641.125 msec, 3.1 MB/S 06-20 20:22:55.055 172.17.2.235:54321 10128 # Session INFO: self /172.17.2.235:54321 25 usec, 76.2 KB/S 20 usec, 94.6 MB/S 40 usec, 47.71 GB/S 06-20 20:23:16.975 172.17.2.235:54321 10128 FJ-0-9 INFO: 19. tree was built in 00:02:49.831 (Wall: 20-Jun 20:23:16.975) 06-20 20:23:17.505 172.17.2.235:54321 10128 FJ-0-9 INFO: ==============================================================

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: here's the code I added to the jenkins nightly job, in case someone removes it or I want to add it elsewhere

***

either enable this section or the above test section

if [[ 1 -eq 1 ]] then

kbn 6/19/15

now it's this!

cd $WORKSPACE/h2o-test-integ/tests/hdfs-bigdata

cd $WORKSPACE/h2o-test-integ/tests/hdfs

CLOUD=$CLOUD_IP:$CLOUD_PORT

have to copy some of these files to hdfs

is it deleting keys at the end of each test? hopefully things don't spill

function DO_TEST() { ../../../scripts/run.py --wipeall --usecloud $CLOUD --test $1 }

light the network burner and standback

it should go out if this job dies

run it in the background.

sshpass --- ssh 0xdiag@mr-0xe1 /home/0xdiag/loop_iperf_burn.sh & sshpass --- ssh 0xdiag@mr-0xe2 /home/0xdiag/loop_iperf_burn.sh & sshpass --- ssh 0xdiag@mr-0xe3 /home/0xdiag/loop_iperf_burn.sh & sshpass --- ssh 0xdiag@mr-0xe4 /home/0xdiag/loop_iperf_burn.sh & sshpass --- ssh 0xdiag@mr-0xe5 /home/0xdiag/loop_iperf_burn.sh &

monitoring bw:

even better is iftop interactive on all nodes, or speedometer -t eth0

see list on mr-0x1 thru mr-0x10 /root/show_network_bw.sh

or at http://www.binarytides.com/linux-commands-monitor-network/

wait to be sure things are going

sleep 5

DO_TEST runit_DL_186KRows_3.2KCols_xlarge.R

DO_TEST runit_DL_1MRows_2.2KCols_xlarge.R

DO_TEST runit_DL_airlines_billion_xlarge.R

DO_TEST runit_GBM_15MRows_2.2KCols_xlarge.R

DO_TEST runit_GBM_186KRows_3.2KCols_xlarge.R

DO_TEST runit_GBM_1MRows_2.2KCols_xlarge.R

DO_TEST runit_GBM_376KRows_6KCols_xlarge.R

DO_TEST runit_GBM_AUTO_airlines_billion_xlarge.R

DO_TEST runit_GBM_Bernoulli_airlines_billion_xlarge.R

DO_TEST runit_GBM_Multinomial_airlines_billion_xlarge.R

DO_TEST runit_GLM_15MRows_2.2KCols_xlarge.R

DO_TEST runit_GLM_186KRows_3.2KCols_xlarge.R

DO_TEST runit_GLM_1MRows_2.2KCols_xlarge.R

DO_TEST runit_GLM_376KRows_6KCols_xlarge.R

DO_TEST runit_GLM_IRLSM_airlines_billion_xlarge.R

DO_TEST runit_GLM_LBFGS_airlines_billion_xlarge.R

DO_TEST runit_hadoop_airlines_xlarge.R

DO_TEST runit_RF_15MRows_2.2KCols_xlarge.R

DO_TEST runit_RF_186KRows_3.2KCols_xlarge.R

DO_TEST runit_RF_1MRows_2.2KCols_xlarge.R

DO_TEST runit_RF_airlines_billion_xlarge.R

fi

***

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: tried another case h2o network test when the network is about 60%-70% busy due to others, better..h2o seems to get the unused bw only?

sustained 75MB/sec of traffic not due to h2o on all links (peak is maybe 117MB/sec on 1GbE)

ran network test while doing GBM model build.

much better than before with more loaded network, but

I could probably do a graph showing network busyness, vs h2o network test results to remote

about 40MB/sec here

so that's good, that means h2o can grab the remaining bandwidth and get to peak GbE rates (at least for this netowrk test) Network Test destination 1_bytes 1024_bytes 1048576_bytes all - collective bcast/reduce 1.334 sec, 7 B/S 341.445 msec, 29.3 KB/S 641.132 msec, 15.6 MB/S remote /172.17.2.231:54321 4.677 msec, 427 B/S 605 usec, 3.2 MB/S 50.052 msec, 40.0 MB/S remote /172.17.2.232:54321 39.868 msec, 50 B/S 273 usec, 7.2 MB/S 49.145 msec, 40.7 MB/S remote /172.17.2.233:54321 3.290 msec, 607 B/S 313 usec, 6.2 MB/S 50.555 msec, 39.6 MB/S remote /172.17.2.234:54321 3.845 msec, 520 B/S 520 usec, 3.8 MB/S 55.396 msec, 36.1 MB/S self /172.17.2.235:54321 55 usec, 35.1 KB/S 31 usec, 62.7 MB/S 31 usec, 61.99 GB/S

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: I have packet loss enabled now. this one is curious to me. The bw numbers seem good.

but the "all -collective bcast/reduce" seems low. It takes 1.45 sec. I've seen it take just 641 msec in a loaded bw case. Does the packet loss hurt that case more?

Network Test destination 1_bytes 1024_bytes 1048576_bytes all - collective bcast/reduce 571.594 msec, 17 B/S 98.046 msec, 102.0 KB/S 1.450 sec, 6.9 MB/S remote /172.17.2.231:54321 6.012 msec, 332 B/S 278 usec, 7.0 MB/S 20.135 msec, 99.3 MB/S remote /172.17.2.232:54321 4.417 msec, 452 B/S 237 usec, 8.2 MB/S 18.610 msec, 107.5 MB/S remote /172.17.2.233:54321 3.301 msec, 605 B/S 253 usec, 7.7 MB/S 19.009 msec, 105.2 MB/S remote /172.17.2.234:54321 2.807 msec, 712 B/S 1.496 msec, 1.3 MB/S 19.110 msec, 104.7 MB/S self /172.17.2.235:54321 29 usec, 66.4 KB/S 27 usec, 72.2 MB/S 28 usec, 68.80 GB/S

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: was just thinking how most of the people's multi-machine h2o testing is on the hdp2.1 cluster, when you talk about multiple clouds on the same machine, stressing bandwidth.

the problem is: h2o never stresses that network, even with mutiple clouds

because it's a 10G network, and h2o can't do more than 200Mbits/sec of network bw, which is only around 1.6 Gbits/sec

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: changed the udp datagram size down to 1470 bytes with -l 1470 and enough bandwidth to use all available (117MB/sec)

this is h2o bw test then

Network Test destination 1_bytes 1024_bytes 1048576_bytes all - collective bcast/reduce 800.198 msec, 12 B/S 128.367 msec, 77.9 KB/S 4.391 sec, 2.3 MB/S remote /172.17.2.231:54321 17.386 msec, 115 B/S 4.578 msec, 436.8 KB/S 192.186 msec, 10.4 MB/S remote /172.17.2.232:54321 83.772 msec, 23 B/S 2.940 msec, 680.2 KB/S 383.072 msec, 5.2 MB/S remote /172.17.2.233:54321 3.254 msec, 614 B/S 4.701 msec, 425.4 KB/S 401.580 msec, 5.0 MB/S remote /172.17.2.234:54321 3.871 msec, 516 B/S 3.635 msec, 550.2 KB/S 709.427 msec, 2.8 MB/S self /172.17.2.235:54321 27 usec, 70.7 KB/S 50 usec, 38.8 MB/S 37 usec, 51.61 GB/S

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-1482 Assignee: New H2O Bugs Reporter: Kevin Normoyle State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A