rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 28 forks source link

EC2-MNMG instructions do not connect to workers within NVIDIA AWS instances #308

Open taureandyernv opened 9 months ago

taureandyernv commented 9 months ago

Deploying an ec2-mnmg cluster setup does not properly connect the workers that were created (and can be shut down) properly.

https://docs.rapids.ai/deployment/nightly/cloud/aws/ec2-multi/#cluster-setup

This may be a network issue with NVIDIA security groups or an issue with dask-cloud provider, which is reportedly needs more resources. Found during 23.12 deployment testing. cc @aravenel @jacobtomlinson

Proposed solutions:

  1. Test with a more permission security group, closer to our customer's or average user
  2. If the issue persists after testing one, allocate resources to fix any bugs in dask cloud provider.