skyplane-project / skyplane

🔥 Blazing fast bulk data transfers between any cloud 🔥
https://skyplane.org
Apache License 2.0
1.09k stars 62 forks source link

[bug] Transfer fails with crash #785

Open ephemer opened 1 year ago

ephemer commented 1 year ago

Describe the bug I am trying to transfer about 100GB from gs:// to s3://

After finally working around permissions issues (see #784) I am now experiencing a crash at about 1/3 the way through (34GB transferred).

To Reproduce

  1. skyplane cp --reuse-gateways --recursive gs://note_detection_data/targets s3://note-detection-data/targets --confirm

Expected behavior The transfer completes without error.

Screenshots

                                                  GiB   MB/s       ⠦ Transfer progress (2819 of 169349 chunks) ━╸━━━ 34.0… 226… 0:04:…
                                                  GiB   MB/s
Compression saved 97.35% of egress fees
Traceback (most recent call last):
  File
"/opt/homebrew/lib/python3.9/site-packages/skyplane/cli/cli_impl/cp
_replicate.py", line 330, in launch_replication_job
    stats = rc.monitor_transfer(
  File
"/opt/homebrew/lib/python3.9/site-packages/skyplane/replicate/repli
cator_client.py", line 589, in monitor_transfer
    do_parallel(lambda i: i.run_command("echo 1"),
self.bound_nodes.values(), n=-1)
  File
"/opt/homebrew/lib/python3.9/site-packages/skyplane/utils/fn.py",
line 57, in do_parallel
    args, result = future.result()
  File
"/opt/homebrew/Cellar/python@3.9/3.9.13_2/Frameworks/Python.framewo
rk/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line
439, in result
    return self.__get_result()
  File
"/opt/homebrew/Cellar/python@3.9/3.9.13_2/Frameworks/Python.framewo
rk/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line
391, in __get_result
    raise self._exception
  File
"/opt/homebrew/Cellar/python@3.9/3.9.13_2/Frameworks/Python.framewo
rk/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line
58, in run
    result = self.fn(*self.args, **self.kwargs)
  File
"/opt/homebrew/lib/python3.9/site-packages/skyplane/utils/fn.py",
line 46, in wrapped_fn
    raise e
  File
"/opt/homebrew/lib/python3.9/site-packages/skyplane/utils/fn.py",
line 43, in wrapped_fn
    return args, func(args)
  File
"/opt/homebrew/lib/python3.9/site-packages/skyplane/replicate/repli
cator_client.py", line 589, in <lambda>
    do_parallel(lambda i: i.run_command("echo 1"),
self.bound_nodes.values(), n=-1)
  File
"/opt/homebrew/lib/python3.9/site-packages/skyplane/compute/server.
py", line 222, in run_command
    _, stdout, stderr = client.exec_command(command)
  File
"/opt/homebrew/lib/python3.9/site-packages/paramiko/client.py",
line 540, in exec_command
    chan.exec_command(command)
  File
"/opt/homebrew/lib/python3.9/site-packages/paramiko/channel.py",
line 70, in _check
    return func(self, *args, **kwds)
  File
"/opt/homebrew/lib/python3.9/site-packages/paramiko/channel.py",
line 255, in exec_command
    self._wait_for_event()
  File
"/opt/homebrew/lib/python3.9/site-packages/paramiko/channel.py",
line 1224, in _wait_for_event
    raise e
  File
"/opt/homebrew/lib/python3.9/site-packages/paramiko/transport.py",
line 2109, in run
    ptype, m = self.packetizer.read_message()
  File
"/opt/homebrew/lib/python3.9/site-packages/paramiko/packet.py",
line 453, in read_message
    header = self.read_all(self.__block_size_in, check_rekey=True)
  File
"/opt/homebrew/lib/python3.9/site-packages/paramiko/packet.py",
line 303, in read_all
    raise EOFError()
EOFError

/opt/homebrew/Cellar/python@3.9/3.9.13_2/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Transfer client log

11:40:11 [WARN]  Using BBR, make sure you indend to!
11:40:11 [WARN]  Using BBR, make sure you indend to!
11:40:14 [DEBUG] Starting gateway aws:eu-west-1:i-0bb9e29fe0d491c36, host: 3.250.229.154 docker pull in 1.0454459190368652
11:40:14 [DEBUG] Starting gateway aws:eu-west-1:i-0bb9e29fe0d491c36, host: 3.250.229.154: Starting gateway container
11:40:15 [INFO]  Starting gateway aws:eu-west-1:i-0bb9e29fe0d491c36, host: 3.250.229.154: sudo docker run -d --log-driver=local --log-opt max-file=16 --ipc=host --network=host --ulimit nofile=1048576 --mount type=tmpfs,dst=/skyplane,tmpfs-size=$(($(free -b  | head -n2 | tail -n1 | awk '{print $2}')/2)) -v /tmp/config:/pkg/data/config -v /tmp/e2ee_key:/pkg/data/e2ee_key --env SKYPLANE_IS_GATEWAY=1 --env SKYPLANE_CONFIG=/pkg/data/config --env AWS_METADATA_SERVICE_NUM_ATTEMPTS=4 --env AWS_METADATA_SERVICE_TIMEOUT=10 --env AWS_DEFAULT_REGION=eu-west-1 --env E2EE_KEY_FILE=/pkg/data/e2ee_key --name skyplane_gateway public.ecr.aws/s6m1p0n8/skyplane:0.2.1 /bin/bash -c "/etc/init.d/stunnel4 start && python -u /pkg/skyplane/gateway/gateway_daemon.py --chunk-dir /skyplane/chunks --outgoing-ports '{}' --region aws:eu-west-1 --use-compression  --disable-tls"
11:40:15 [DEBUG] Starting gateway aws:eu-west-1:i-0bb9e29fe0d491c36, host: 3.250.229.154: Gateway started b88427e0f478bc93be063574b6c931f9a02bb78fec20db50345e419287affa87
11:40:16 [DEBUG] Bound remote port aws:eu-west-1:i-0bb9e29fe0d491c36:8888 to localhost:52305
11:40:16 [DEBUG] aws:eu-west-1:i-0bb9e29fe0d491c36 log_viewer_url = http://127.0.0.1:52305/container/b88427e0f478
11:40:17 [DEBUG] Bound remote port aws:eu-west-1:i-0bb9e29fe0d491c36:8081 to localhost:52307
11:40:17 [DEBUG] aws:eu-west-1:i-0bb9e29fe0d491c36 gateway_api_url = http://127.0.0.1:52307
11:40:17 [DEBUG] [wait_for] Waiting for gateway aws:eu-west-1:i-0bb9e29fe0d491c36 to start fn=<function Server.start_gateway.<locals>.is_api_ready at 0x171bb4790> completed in 0.17s
11:40:18 [DEBUG] Starting gateway gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, host: 34.83.215.222 docker pull in 0.6846699714660645
11:40:18 [DEBUG] Starting gateway gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, host: 34.83.215.222: Starting gateway container
11:40:24 [INFO]  Starting gateway gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, host: 34.83.215.222: sudo docker run -d --log-driver=local --log-opt max-file=16 --ipc=host --network=host --ulimit nofile=1048576 --mount type=tmpfs,dst=/skyplane,tmpfs-size=$(($(free -b  | head -n2 | tail -n1 | awk '{print $2}')/2)) -v /tmp/config:/pkg/data/config -v /tmp/service_account_key.json:/pkg/data/service_account_key.json -v /tmp/e2ee_key:/pkg/data/e2ee_key --env SKYPLANE_IS_GATEWAY=1 --env SKYPLANE_CONFIG=/pkg/data/config --env GCP_SERVICE_ACCOUNT_FILE=/pkg/data/service_account_key.json --env E2EE_KEY_FILE=/pkg/data/e2ee_key --name skyplane_gateway public.ecr.aws/s6m1p0n8/skyplane:0.2.1 /bin/bash -c "/etc/init.d/stunnel4 start && python -u /pkg/skyplane/gateway/gateway_daemon.py --chunk-dir /skyplane/chunks --outgoing-ports '{\"3.250.229.154\": 32}' --region gcp:us-west1-b --use-compression  --disable-tls"
11:40:25 [DEBUG] Starting gateway gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, host: 34.83.215.222: Gateway started 817afb69a003bff7409d6b8ce4ed57b383b289f48cf32f1fe19a2eef0564483e
11:40:26 [DEBUG] Bound remote port gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144:8888 to localhost:52312
11:40:26 [DEBUG] gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144 log_viewer_url = http://127.0.0.1:52312/container/817afb69a003
11:40:28 [DEBUG] Bound remote port gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144:8081 to localhost:52314
11:40:28 [DEBUG] gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144 gateway_api_url = http://127.0.0.1:52314
11:40:29 [DEBUG] [wait_for] Waiting for gateway gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144 to start fn=<function Server.start_gateway.<locals>.is_api_ready at 0x171bb4f70> completed in 0.66s
11:40:29 [INFO]  Log URLs for gcp:us-west1-b:skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144 (gcp:us-west1-b:0)
11:40:29 [INFO]     Log viewer: http://127.0.0.1:52312/container/817afb69a003
11:40:29 [INFO]     API: http://127.0.0.1:52314
11:40:29 [INFO]  Log URLs for aws:eu-west-1:i-0bb9e29fe0d491c36 (aws:eu-west-1:0)
11:40:29 [INFO]     Log viewer: http://127.0.0.1:52305/container/b88427e0f478
11:40:29 [INFO]     API: http://127.0.0.1:52307
11:40:29 [DEBUG] initiate_multipart_transfers: 0.00s
11:40:29 [INFO]  Batch 0 size: 100954957834 with 169349 chunks
11:40:29 [DEBUG] Batch 0 size: 100954957834 with 169349 chunks
11:40:54 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 152965 remaining
11:41:14 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 136581 remaining
11:41:23 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 120197 remaining
11:41:40 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 103813 remaining
11:41:55 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 87429 remaining
11:42:04 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 71045 remaining
11:42:13 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 54661 remaining
11:42:23 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 38277 remaining
11:42:32 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 21893 remaining
11:42:43 [DEBUG] Sent 16384 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 5509 remaining
11:42:46 [DEBUG] Sent 5509 chunk requests to skyplane-gcp-410fdbe8814c4fb9b02f089f18d86144, 0 remaining
11:42:46 [DEBUG] Building chunk requests: 137.18s
11:42:54 [DEBUG] 79.497776s: 141/169349 chunks, 3758483366/100954957834 bytes, 0.35 Gbit/s
11:43:11 [DEBUG] 133.47798s: 1025/169349 chunks, 19123536866/100954957834 bytes, 1.07 Gbit/s
11:43:16 [DEBUG] 133.47798s: 1066/169349 chunks, 19685827394/100954957834 bytes, 1.10 Gbit/s
11:43:21 [DEBUG] 133.47798s: 1066/169349 chunks, 19685827394/100954957834 bytes, 1.10 Gbit/s
11:43:41 [DEBUG] 176.272467s: 1098/169349 chunks, 20105185875/100954957834 bytes, 0.85 Gbit/s
11:43:47 [DEBUG] 189.603239s: 2309/169349 chunks, 32414395306/100954957834 bytes, 1.27 Gbit/s
11:44:01 [DEBUG] 194.817813s: 2649/169349 chunks, 35151552984/100954957834 bytes, 1.34 Gbit/s
11:44:06 [DEBUG] 197.158828s: 2819/169349 chunks, 36461482816/100954957834 bytes, 1.38 Gbit/s
11:44:12 [DEBUG] 197.158828s: 2819/169349 chunks, 36461482816/100954957834 bytes, 1.38 Gbit/s
11:44:17 [DEBUG] 197.158828s: 2819/169349 chunks, 36461482816/100954957834 bytes, 1.38 Gbit/s
11:44:23 [DEBUG] 226.153999s: 2819/169349 chunks, 36461482816/100954957834 bytes, 1.20 Gbit/s
11:44:29 [DEBUG] 231.960078s: 2819/169349 chunks, 36461482816/100954957834 bytes, 1.17 Gbit/s
11:44:29 [ERROR] Error running <lambda>: 
11:44:31 [INFO]  Total compressed bytes sent: 0.90GB
11:44:31 [INFO]  Total uncompressed bytes sent: 33.96GB
11:44:31 [INFO]  Compression ratio: 0.026481664771359582

Environment info (please complete the following information):

Additional context I had to downgrade the default GCP instance size to n2-standard-8 (down from 32) because GCP won't raise our quota. Maybe this is an OOM error?

ephemer commented 1 year ago

I tried this again and it seemed to work 🤷🏼‍♂️

sarahwooders commented 1 year ago

Thanks for reporting this! I'm going to re-open it since it looks like a concurrency bug.