skyplane-project / skyplane

🔥 Blazing fast bulk data transfers between any cloud 🔥
https://skyplane.org
Apache License 2.0
1.08k stars 62 forks source link

[bug] No space left on device (Receiver AWS Instance) #777

Open hibanacreatives opened 1 year ago

hibanacreatives commented 1 year ago

Describe the bug Did my best to check the issues and docs but didn't see this come up.

I'm transferring a single large (~2TB) file from GCP to AWS. It gets about 1% done before it stops, deprovisions, and displays the following errors:

❌ AWSServer(region_tag=aws:us-west-2, instance_id=i-0d1fdaf3689f0563d) encountered error: 
          Traceback (most recent call last)
          File "/pkg/skyplane/gateway/gateway_receiver.py", line 93, in server_worker  self.recv_chunks(ssl_conn, addr)                                                                                                                                                                                                           
          File "/pkg/skyplane/gateway/gateway_receiver.py", line 181, in recv_chunks f.write(to_write)
          OSError: [Errno 28] No space left on device

The receiver instance seems to be running out of storage space. It's the default m5.8xlarge image. I'm not sure how much storage space is allocated and didn't see it as a configuration option. This tracks with trying higher -n settings seem to get farther along, as the design should split the capacity across the instances.

Multipart was enabled during the runs.

Any best practices when handling single large files as I suspect that's near the root of the problem.

Thanks for the help and really cool project!

To Reproduce Steps to reproduce the behavior (please include the full Skyplane command you ran):

Expected behavior The transfer should complete without errors and my file should show up in aws s3

Transfer client log In the log output from Skyplane, please upload the debug log from the CLI. You can find the path to the file in the log output:

$ skyplane cp ...
...
Storing debug information for transfer in /tmp/skyplane/transfer_logs/...
...

Environment info (please complete the following information):

Additional context I've done attempts with varying number of instances but had the same result.

Skyplane Config:

(harmonic-rnd) kikou@healthwallet-dev01 ~/LocalCode/harmonic-rnd $ skyplane config list
autoconfirm = False
bbr = True
compress = True
encrypt_e2e = True
encrypt_socket_tls = False
verify_checksums = True
multipart_enabled = True
multipart_min_threshold_mb = 128
multipart_min_size_mb = 5
multipart_chunk_size_mb = 64
multipart_max_chunks = 9990
num_connections = 32
max_instances = 1
autoshutdown_minutes = 15
aws_use_spot_instances = True
azure_use_spot_instances = False
gcp_use_spot_instances = False
aws_instance_class = m5.8xlarge
azure_instance_class = Standard_D32_v5
gcp_instance_class = n2-standard-32
gcp_use_premium_network = True
usage_stats = True
gcp_service_account_name = skyplane-manual
requester_pays = False
native_cmd_enabled = True
native_cmd_threshold_gb = 2

I'm thinking of maybe trying a storage focused image instead of the default.

sarahwooders commented 1 year ago

Thanks for reporting this issue @hibanacreatives! Could you also please attach the client.log file? The file is printed at the start of the transfer like:

Logging to: /tmp/skyplane/transfer_logs/20230313_172455-c9bc9280/client.log

Also, Skyplane download any gateway logs (their format is /tmp/skyplane/transfer_logs/20230313_172455/gateway_aws:us-east-1:i-0e776422b8c43582e.stdout)?

I think this is a bug in Skyplane so I need to look into it further. A temporary workaround might be to use more VMs - how many did you try?

hibanacreatives commented 1 year ago

Thanks for the reply @sarahwooders . Totally meant to upload the log with the report, oops. Here ya go. client.log

No gateway logs found.

Tried up to -n5 but then started hitting some quota limits.

Thanks and let me know if I can help further.

sarahwooders commented 1 year ago

Thanks for the client log! We're working on a potential fix right now so will keep you posted with that.

sarahwooders commented 1 year ago

Hi @hibanacreatives - I'm actually having some trouble reproducing the error. Could you please try upgrading skyplane with pip install --upgrade skyplane, and then re-run the command with the --debug flag? That should download the gateway logs, and it would be great if you could share those with me.

hibanacreatives commented 1 year ago

@sarahwooders Ohh! Will do. I was on 0.2.1 and I see there's a 0.3.0. I'll give that a go a little bit later tonight and let you know. Thanks for taking the time to poke at it.

hibanacreatives commented 1 year ago

I tried again and bumped into a different set of errors. The gateway allocation is timing out. I'm going to try again with a fresh environment and will post some debug info.

sarahwooders commented 1 year ago

Ah ok - please post the client.log files and the gateway logs from --debug mode when you get the chance!

hibanacreatives commented 1 year ago

debug_files.zip Files attached. I noticed a ModuleNotFound error missing 'typer' I confirmed it was in my venv. I saw it in the gateway logs. Does that mean that perhaps the gateway environment is missing that module somehow?

Please let me know what else I can do to help diagnose. Thanks for your time.

sarahwooders commented 1 year ago

Sorry for the delayed response - we just fixed the bug on the gateways. Could you please upgrade Skyplane try again? Really appreciate your help with debugging this issue.

sarahwooders commented 1 year ago

@hibanacreatives were you able to resolve this issue?

hibanacreatives commented 1 year ago

Hi! Sorry for the delayed response. I worked around my issue, but very happy to help debug. It still didn't work iirc. I'll have some time to generate debug info this weekend.

Thanks for the ping

sarahwooders commented 1 year ago

@hibanacreatives yes would really appreciate getting some of the logs so we can fix this for future users!