skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[AWS] SSH issue when a large number of nodes are used in a cluster #4305

Open Michaelvll opened 1 week ago

Michaelvll commented 1 week ago

When a user is trying to launch a large number of nodes in a cluster, sometimes a small portion of nodes may experience failure of being ssh into. Stopping that instance on console manually and relaunch can fix it.

EC2 Instance Connect is unable to connect to your instance. Ensure your instance network settings are configured correctly for EC2 Instance Connect. For more information, see EC2 Instance Connect Prerequisites at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-prerequisites.html.

Version & Commit info: