skyplane-project / skyplane

πŸ”₯ Blazing fast bulk data transfers between any cloud πŸ”₯
https://skyplane.org
Apache License 2.0
1k stars 58 forks source link

[SKY-270] [bug] Leaked instances for Ctrl-C of transfers #885

Open sarahwooders opened 1 year ago

sarahwooders commented 1 year ago

Describe the bug There is consistently 1 leaked VM after a transfer is quit.

To Reproduce Run transfer skyplane cp -r gs://skyplane-big-test-bucket/OPT-cloudflare/ s3://test-us-east-1-7711e4ae/. During dispatch, Ctrl-C exit the transfer.

Transfer client log

Logging to: /tmp/skyplane/transfer_logs/20230623_145734-bd9ae325/client.log
Using Skyplane version 0.3.2
Will transfer objects from gcp:us-central1-a to aws:us-east-1
14:57:36 [WARN]  Quota limit file not found for aws:us-east-1. Try running `skyplane init --reinit-aws` to load the quota information
  VMs to provision: 1x aws:us-east-1, 1x gcp:us-central1-a
  Estimated egress cost: $0.12/GB
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-0.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-0.pt
(15.34GB)
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-1.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-1.pt
(15.34GB)
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-2.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-2.pt
(15.34GB)
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-3.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-3.pt
(15.34GB)
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-4.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-4.pt
(15.34GB)
  ...
Transfer starting
14:57:41 [WARN]  Quota limit file not found for aws:us-east-1. Try running `skyplane init --reinit-aws` to load the quota information
βœ“ Provisioning VMs (2/2) in 37.14s
β Ό Authorizing gateways with firewalls ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/2 0:00:0114:58:41 [WARN]  :us-east-1 Error adding IPs to security group, since it already exits: An error occurred (InvalidPermission.Duplicate)
when calling the AuthorizeSecurityGroupIngress operation: the specified rule "peer: 0.0.0.0/0, ALL, ALLOW" already exists
βœ“ Starting gateway container on VMs (2/2) in 28.52s
β Ή Transfer progressaws:us-east-1 ━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/122.7 GiB 482.5 MB/s 0:04:15^C
Transfer cancelled by user. Copying gateway logs and exiting.
β ‡ Transfer progressaws:us-east-1 ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.1/122.7 GiB 473.2 MB/s 0:04:1415:00:00 [ERROR] Error running <lambda>, GCPServer(region_tag=gcp:us-central1-a, instance_name=skyplane-gcp-de24eada): 'NoneType'
object has no attribute 'open_session'
15:00:00 [ERROR] Error running <lambda>, AWSServer(region_tag=aws:us-east-1, instance_id=i-0861627e6ae3b80f1): 'NoneType' object has no
attribute 'open_session'
Exception in thread Thread-35:
Traceback (most recent call last):
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 181, in monitor_single_dst_helper
    self.monitor_transfer(dst_region)
  File "/Users/sarahwooders/repos/skyplane/skyplane/utils/imports.py", line 33, in wrapped
    return fn(*modules_imported, *args, **kwargs)
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 278, in monitor_transfer
    do_parallel(lambda i: i.run_command("echo 1"), self.dataplane.bound_nodes.values(), n=8)
  File "/Users/sarahwooders/repos/skyplane/skyplane/utils/fn.py", line 57, in do_parallel
    args, result = future.result()
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
    return self.__get_result()
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/thread.py",
line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/sarahwooders/repos/skyplane/skyplane/utils/fn.py", line 43, in wrapped_fn
    return args, func(args)
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 278, in <lambda>
    do_parallel(lambda i: i.run_command("echo 1"), self.dataplane.bound_nodes.values(), n=8)
  File "/Users/sarahwooders/repos/skyplane/skyplane/compute/server.py", line 241, in run_command
    _, stdout, stderr = client.exec_command(command)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.10/site-packages/paramiko/client.py", line 560, in exec_command
    chan = self._transport.open_session(timeout=timeout)
AttributeError: 'NoneType' object has no attribute 'open_session'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in
_bootstrap_inner
    self.run()
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 216, in run
    raise e
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 214, in run
    results.append(future.result())
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
    return self.__get_result()
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/thread.py",
line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 194, in monitor_single_dst_helper
    UsageClient.log_exception(
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/usage.py", line 147, in log_exception
    stats = client.make_error(
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/usage.py", line 304, in make_error
    dest_regions = [tag.split(":")[1] for tag in dest_region_tags]
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/usage.py", line 304, in <listcomp>
    dest_regions = [tag.split(":")[1] for tag in dest_region_tags]
IndexError: list index out of range
β ‡ Transfer progressaws:us-east-1 ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.1/122.7 GiB 473.2 MB/s 0:04:14%

Environment info (please complete the following information):

SKY-270