Open danking opened 1 year ago
I used the gsutil storage bandwidth tool and confirmed we get 1.2 Gibit / second upload and download speeds from within a 1 core job and 10 Gi storage. Adding more cores didn't change anything.
I ran a test job with the copy tool on a 10 Gi random file and matched 1.2 Gibit / second.
I'm wondering if the problem is actually workload-dependent and is based on the number of jobs / number of files. The GCS best practices states the initial capacity is 5000 read requests / second per bucket including list operations until the bucket has time to scale up its capacity.
https://cloud.google.com/storage/docs/request-rate#best-practices
==============================================================================
DIAGNOSTIC RESULTS
==============================================================================
------------------------------------------------------------------------------
Latency
------------------------------------------------------------------------------
Operation Size Trials Mean (ms) Std Dev (ms) Median (ms) 90th % (ms)
========= ========= ====== ========= ============ =========== ===========
Delete 0 B 5 43.1 6.4 40.9 50.9
Delete 1 KiB 5 44.2 12.7 42.5 58.1
Delete 100 KiB 5 44.7 10.4 42.8 56.3
Delete 1 MiB 5 41.5 3.7 40.2 45.7
Download 0 B 5 74.6 7.9 73.2 84.0
Download 1 KiB 5 84.3 15.9 80.6 103.4
Download 100 KiB 5 81.9 16.0 82.7 99.6
Download 1 MiB 5 90.6 6.5 94.5 96.8
Metadata 0 B 5 23.6 2.7 23.6 26.3
Metadata 1 KiB 5 25.5 2.1 26.9 27.4
Metadata 100 KiB 5 26.2 3.6 27.3 29.9
Metadata 1 MiB 5 24.0 3.7 23.3 28.4
Upload 0 B 5 98.1 16.6 95.5 117.9
Upload 1 KiB 5 116.7 21.8 115.5 142.1
Upload 100 KiB 5 116.5 17.8 115.1 135.1
Upload 1 MiB 5 168.2 18.5 179.6 185.6
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied 5 512 MiB file(s) for a total transfer size of 2.5 GiB.
Write throughput: 977.7 Mibit/s.
Parallelism strategy: both
------------------------------------------------------------------------------
Read Throughput
------------------------------------------------------------------------------
Copied 5 512 MiB file(s) for a total transfer size of 2.5 GiB.
Read throughput: 1.11 Gibit/s.
Parallelism strategy: both
------------------------------------------------------------------------------
System Information
------------------------------------------------------------------------------
IP Address:
172.21.46.11
Temporary Directory:
/tmp
Bucket URI:
gs://hail-jigold/
gsutil Version:
5.24
boto Version:
2.49.0
Measurement time:
2023-06-05 03:25:16 PM
Running on GCE:
True
GCE Instance:
Bucket location:
US-CENTRAL1
Bucket storage class:
REGIONAL
Google Server:
Google Server IP Addresses:
142.250.128.128
142.251.6.128
108.177.112.128
74.125.124.128
172.217.212.128
172.217.214.128
172.253.119.128
108.177.111.128
142.250.1.128
108.177.121.128
142.250.103.128
108.177.120.128
142.250.159.128
142.251.120.128
142.251.161.128
74.125.126.128
Google Server Hostnames:
ib-in-f128.1e100.net
ic-in-f128.1e100.net
jo-in-f128.1e100.net
jp-in-f128.1e100.net
jq-in-f128.1e100.net
jr-in-f128.1e100.net
jt-in-f128.1e100.net
jv-in-f128.1e100.net
jw-in-f128.1e100.net
jx-in-f128.1e100.net
jy-in-f128.1e100.net
jz-in-f128.1e100.net
ie-in-f128.1e100.net
if-in-f128.1e100.net
ig-in-f128.1e100.net
ik-in-f128.1e100.net
Google DNS thinks your IP is:
CPU Count:
16
CPU Load Average:
[32.39, 33.2, 19.0]
Total Memory:
57.5 GiB
Free Memory:
38.41 GiB
TCP segment counts not available because "netstat" was not found during test runs
Disk Counter Deltas:
disk reads writes rbytes wbytes rtime wtime
loop0 0 0 0 0 0 0
loop1 0 0 0 0 0 0
loop3 0 0 0 0 0 0
loop4 0 0 0 0 0 0
loop5 0 0 0 0 0 0
nvme0n1 4385 4694 581857280 1743810560 6453 527129
sda1 0 544 0 3731456 0 429
sda14 0 0 0 0 0 0
sda15 0 0 0 0 0 0
TCP /proc values:
tcp_timestamps = 1
tcp_sack = 1
tcp_window_scaling = 1
Boto HTTPS Enabled:
True
Requests routed through proxy:
False
Latency of the DNS lookup for Google Storage server (ms):
1.5
Latencies connecting to Google Storage server IPs (ms):
74.125.126.128 = 1.1
------------------------------------------------------------------------------
In-Process HTTP Statistics
------------------------------------------------------------------------------
Total HTTP requests made: 149
HTTP 5xx errors: 0
HTTP connections broken: 0
Availability: 100%
Output file written to '/tmp/output.json'.
@jigold
I ran a test job with the copy tool on a 10 Gi random file and matched 1.2 Gibit / second.
Does this mean something like:
j = b.new_job()
j.image('hailgenetics/hail:0.2.118')
j.command('python3 -m hailtop.copy ...')
Or did you use a read_input
? I'm curious if something about how we configure the input container could affect this. I doubt it, but wanted to confirm.
https://batch.hail.is/batches/7504201/jobs/1
import hailtop.batch as hb
from time import sleep
import random
backend = hb.ServiceBackend('hail')
for i in range(1):
b = hb.Batch(backend=backend)
for k in range(1):
input = b.read_input('gs://hail-jigold/resources/data/random_10Gi.txt')
j = b.new_job(name='bandwidth_copy_tool_1_10_gi_storage')
j.image('google/cloud-sdk:slim')
j.cpu(1)
j.command(
f'''
ls {input}
''')
j.storage('10Gi')
b.run(wait=False)
backend.close()
What happened?
Reported by Ben Wesiburd: https://hail.zulipchat.com/#narrow/stream/223457-Hail-Batch-support/topic/batch.20OOM.20on.20input/near/352268417
An n1-standard-16 should receive as much as 20 gbps / 1.8M packets-per-second on its external IP (Google). This suggests Hail is operating at 2% of peak efficiency.
Version
batch
Relevant log output
No response