Open romilbhardwaj opened 2 months ago
I'm seeing the same thing with skypilot==0.6.1 trying to start a cluster in Lambda Labs. After the error the Node(s) are indeed in Lambda. Also when running "sky down" it does not tear down the Node(s)
@sfrolich - What GPU are you using? Do you see this for GPUs other than V100?
@sfrolich - What GPU are you using? Do you see this for GPUs other than V100?
Was using A10:1 in this particular test. I presume from where the error occurs that it is with all Lambda VM types but I could be wrong.
I found in the provision.log that there was a connection reset error before the "Failed get obtain private IP from node" error. I looked in Lambda Labs Firewall page and someone in my org had taken off the SSH:22 rule. Once I put it back this error went away.
Also puts SkyPilot in a bad state, where
sky down
also fails with the same error. Have to manually terminate the instance and clean up local SkyPilot state.db.sky -c
:c7ee6a22c2fbb912bbe8f83b2931062e6b4e4490-dirty