tilde-lab / yascheduler

Yet another cloud computing scheduler for the high-throughput cloud scientific simulations
https://mpds.io/search/ab%20initio%20calculations
MIT License
5 stars 4 forks source link

Fail to handle connection issues #106

Closed blokhin closed 1 year ago

blokhin commented 1 year ago

The option -v (and -o) for yastatus lists the excerpts from the output logs at each the active machine. The problem is that sometimes the status changes right at the time of the listing, so the machine gets no more available. Then the errors like below occur (should be easy to handle).

..................................................ID2456 aiida-33160 at root@65.109.143.81:hetzner:data/tasks/20221231_044118_2456
INFO:backoff:Backing off create(...) for 0.8s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 0.5s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 1.9s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 2.5s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 4.8s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 6.3s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 4.5s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 4.2s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 1.9s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 3.7s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
INFO:backoff:Backing off create(...) for 1.1s (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
ERROR:backoff:Giving up create(...) after 12 tries (OSError: [Errno 113] Connect call failed ('65.109.143.81', 22))
Traceback (most recent call last):
  File "/usr/local/bin/yastatus", line 8, in <module>
    sys.exit(check_status())
  File "/usr/local/lib/python3.9/dist-packages/yascheduler/utils.py", line 237, in check_status
    asyncio.run(_check_status())
  File "/usr/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/dist-packages/yascheduler/utils.py", line 148, in _check_status
    machine = await RemoteMachine.create(
  File "/usr/local/lib/python3.9/dist-packages/backoff/_async.py", line 151, in retry
    ret = await target(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/yascheduler/remote_machine/remote_machine.py", line 192, in create
    conn = await asyncssh.connection.connect(
  File "/usr/local/lib/python3.9/dist-packages/asyncssh/connection.py", line 7834, in connect
    return await asyncio.wait_for(
  File "/usr/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/usr/local/lib/python3.9/dist-packages/asyncssh/connection.py", line 437, in _connect
    _, session = await loop.create_connection(
  File "/usr/lib/python3.9/asyncio/base_events.py", line 1056, in create_connection
    raise exceptions[0]
  File "/usr/lib/python3.9/asyncio/base_events.py", line 1041, in create_connection
    sock = await self._connect_sock(
  File "/usr/lib/python3.9/asyncio/base_events.py", line 955, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/lib/python3.9/asyncio/selector_events.py", line 502, in sock_connect
    return await fut
  File "/usr/lib/python3.9/asyncio/selector_events.py", line 537, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
OSError: [Errno 113] Connect call failed ('65.109.143.81', 22)
blokhin commented 1 year ago

It turned out, the machine 65.109.143.81 all of a sudden just stopped to be accessible. The status changes are irrelevant.