mila-iqia / milatools

Tools to connect to and interact with the Mila cluster
MIT License
60 stars 11 forks source link

[v0.1.2] Intermittent connection errors ("ssh-copy-id appears to have failed", "An error occured while trying to establish a connection with mila") #102

Closed lebrice closed 6 months ago

lebrice commented 7 months ago

Intermittent errors during mila init and other commands:

'ssh-copy-id mila' appears to have failed!
 ERROR: An error happened while trying to establish a connection with mila
        -The cluster might be under maintenance
           Check #mila-cluster for updates on the state of the cluster
        -Check the status of your connection to the cluster by ssh'ing onto it.
        -Retry connecting with mila
        -Try to exclude the node with -x mila parameter

For example:

 $ mila -vvv init
(...)
Checking connection to compute nodes
 (...)
[02/14/24 14:19:21] ERROR    2024-02-14 14:19:21,979 - ERROR - Exception (client): Error reading SSH protocol banner                                      transport.py:1893
                    ERROR    2024-02-14 14:19:21,985 - ERROR - Traceback (most recent call last):                                                         transport.py:1891
                    ERROR    2024-02-14 14:19:21,988 - ERROR -   File                                                                                     transport.py:1891
                             "/home/fabrice/miniconda3/envs/milatools/lib/python3.11/site-packages/paramiko/transport.py", line 2292, in _check_banner                     
                    ERROR    2024-02-14 14:19:21,991 - ERROR -     buf = self.packetizer.readline(timeout)                                                transport.py:1891
                    ERROR    2024-02-14 14:19:21,993 - ERROR -           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                transport.py:1891
                    ERROR    2024-02-14 14:19:21,994 - ERROR -   File                                                                                     transport.py:1891
                             "/home/fabrice/miniconda3/envs/milatools/lib/python3.11/site-packages/paramiko/packet.py", line 374, in readline                              
                    ERROR    2024-02-14 14:19:21,996 - ERROR -     buf += self._read_timeout(timeout)                                                     transport.py:1891
                    ERROR    2024-02-14 14:19:21,998 - ERROR -            ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                     transport.py:1891
                    ERROR    2024-02-14 14:19:21,999 - ERROR -   File                                                                                     transport.py:1891
                             "/home/fabrice/miniconda3/envs/milatools/lib/python3.11/site-packages/paramiko/packet.py", line 611, in _read_timeout                         
[02/14/24 14:19:22] ERROR    2024-02-14 14:19:22,001 - ERROR -     raise socket.timeout()                                                                 transport.py:1891
                    ERROR    2024-02-14 14:19:22,002 - ERROR - TimeoutError                                                                               transport.py:1891
                    ERROR    2024-02-14 14:19:22,004 - ERROR -                                                                                            transport.py:1891
                    ERROR    2024-02-14 14:19:22,005 - ERROR - During handling of the above exception, another exception occurred:                        transport.py:1891
                    ERROR    2024-02-14 14:19:22,007 - ERROR -                                                                                            transport.py:1891
                    ERROR    2024-02-14 14:19:22,008 - ERROR - Traceback (most recent call last):                                                         transport.py:1891
                    ERROR    2024-02-14 14:19:22,009 - ERROR -   File                                                                                     transport.py:1891
                             "/home/fabrice/miniconda3/envs/milatools/lib/python3.11/site-packages/paramiko/transport.py", line 2113, in run                               
                    ERROR    2024-02-14 14:19:22,010 - ERROR -     self._check_banner()                                                                   transport.py:1891
                    ERROR    2024-02-14 14:19:22,011 - ERROR -   File                                                                                     transport.py:1891
                             "/home/fabrice/miniconda3/envs/milatools/lib/python3.11/site-packages/paramiko/transport.py", line 2296, in _check_banner                     
                    ERROR    2024-02-14 14:19:22,012 - ERROR -     raise SSHException(                                                                    transport.py:1891
                    ERROR    2024-02-14 14:19:22,013 - ERROR - paramiko.ssh_exception.SSHException: Error reading SSH protocol banner                     transport.py:1891
                    ERROR    2024-02-14 14:19:22,014 - ERROR -                                                                                            transport.py:1891
ERROR: An error happened while trying to establish a connection with mila
        -The cluster might be under maintenance
           Check #mila-cluster for updates on the state of the cluster
        -Check the status of your connection to the cluster by ssh'ing onto it.
        -Retry connecting with mila
        -Try to exclude the node with -x mila parameter

These all seem to be caused by this "Error reading SSH protocol banner" error.