Closed gilvikra closed 1 year ago
@gilvikra can you say more about why you have duplicate entries?
My initial impression is that I'm not 100% sure if this is desirable behavior since we currently tend towards just transparently passing the value through to aws.
This is a serious bug. Or please provide a workaround. For distributed training between multiple instances we use as many network interfaces as possible. Like for p4 we will use 4, for trn1.32xlarge we will use 8. ( For reference the corresponding cloudformation template for trainium: https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/dp_bert_hf_pretrain/cfn/eks_trn1_ng_stack.yaml) Right now the below pasted piece of config fails with error:
Not all subnet IDs found: ('subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f')
worker_trn:
min_workers: 2
max_workers: 16
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {"custom_trn_vcpu": 128, "custom_trn_nc": 32}
#resources: {"custom_trn_vcpu": 8, "custom_trn_nc": 2}
node_config:
InstanceType: trn1.32xlarge
#InstanceType: trn1.2xlarge
ImageId: ami-0c8e2149bcc9ba840
UserData: "\n#!/bin/bash \n\ntouch /home/ec2-user/TRN1_MC\nprintf '[neuron]\nname=Neuron YUM Repository\nbaseurl=https://yum.repos.neuron.amazonaws.com\nenabled=1\nmetadata_expire=0\n' > /etc/yum.repos.d/neuron.repo\nrpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB\n\nyum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y\n\nyum update -y\n\nyum install git -y\n\nyum remove aws-neuron-dkms -y\nyum remove aws-neuronx-dkms -y\nyum remove aws-neuronx-oci-hook -y\nyum remove aws-neuronx-runtime-lib -y\nyum remove aws-neuronx-collectives -y\nyum install aws-neuronx-dkms-2.* -y\nyum install aws-neuronx-oci-hook-2.* -y\nyum install aws-neuronx-runtime-lib-2.* -y\nyum install aws-neuronx-collectives-2.* -y\n\n\nyum remove aws-neuron-tools -y\nyum remove aws-neuronx-tools -y\nyum install aws-neuronx-tools-2.* -y\n\nexport PATH=/opt/aws/neuron/bin:$PATH\n\nyum install -y nvme-cli\n\nyum install -y mesa-libGL\nyum install -y htop\nyum install -y amazon-cloudwatch-agent\n/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a start -m ec2 -c default\n\n\namazon-linux-extras install lustre python3.8 -y\nwget https://bootstrap.pypa.io/get-pip.py\nupdate-alternatives --install /usr/bin/python python /usr/bin/python3.8 1\npython3.8 get-pip.py\n\n\npython3.8 -m pip config set global.extra-index-url 'https://pip.repos.neuron.amazonaws.com'\n\npython3.8 -m pip install neuronx-cc==2.* torch-neuronx torchvision torchmetrics\n\ntouch /home/ec2-user/TRN1_SETUP_DONE\ntouch /home/ec2-user/USER_DATA_SETUP_DONE\n"
IamInstanceProfile:
Arn: arn:aws:iam::146036223160:instance-profile/ray-autoscaler-v1
KeyName: ray-alpha-us-east-1-key-pair
EbsOptimized: True
Placement:
GroupName: A9VSPhotonUsEast1b
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#network-cards
NetworkInterfaces:
- AssociatePublicIpAddress: True
DeleteOnTermination: True
InterfaceType: efa
SubnetId: subnet-094edaca2285f2d5f
Groups: [sg-08d7bdf993ee48e52]
DeviceIndex: 0
NetworkCardIndex: 0
- AssociatePublicIpAddress: False
DeleteOnTermination: True
InterfaceType: efa
SubnetId: subnet-094edaca2285f2d5f
Groups: [sg-08d7bdf993ee48e52]
DeviceIndex: 1
NetworkCardIndex: 1
- AssociatePublicIpAddress: False
DeleteOnTermination: True
InterfaceType: efa
SubnetId: subnet-094edaca2285f2d5f
Groups: [sg-08d7bdf993ee48e52]
DeviceIndex: 2
NetworkCardIndex: 2
- AssociatePublicIpAddress: False
DeleteOnTermination: True
InterfaceType: efa
SubnetId: subnet-094edaca2285f2d5f
Groups: [sg-08d7bdf993ee48e52]
DeviceIndex: 3
NetworkCardIndex: 3
- AssociatePublicIpAddress: False
DeleteOnTermination: True
InterfaceType: efa
SubnetId: subnet-094edaca2285f2d5f
Groups: [sg-08d7bdf993ee48e52]
DeviceIndex: 4
NetworkCardIndex: 4
- AssociatePublicIpAddress: False
DeleteOnTermination: True
InterfaceType: efa
SubnetId: subnet-094edaca2285f2d5f
Groups: [sg-08d7bdf993ee48e52]
DeviceIndex: 5
NetworkCardIndex: 5
- AssociatePublicIpAddress: False
DeleteOnTermination: True
InterfaceType: efa
SubnetId: subnet-094edaca2285f2d5f
Groups: [sg-08d7bdf993ee48e52]
DeviceIndex: 6
NetworkCardIndex: 6
- AssociatePublicIpAddress: False
DeleteOnTermination: True
InterfaceType: efa
SubnetId: subnet-094edaca2285f2d5f
Groups: [sg-08d7bdf993ee48e52]
DeviceIndex: 7
NetworkCardIndex: 7
BlockDeviceMappings:
# root device is xvda for al2, ubunto cannot mount more than 2TB by default as root volume, https://aws.amazon.com/premiumsupport/knowledge-center/ec2-ubuntu-convert-mbr-to-gpt/, https://www.dolthub.com/blog/2022-05-02-use-more-than-2TB-ubuntu-ec2/
#- DeviceName: /dev/sdb
- DeviceName: /dev/xvda
Ebs:
VolumeSize: 1000
VolumeType: gp3
#VolumeType: io2
# can go up to 64000
#Iops: 15000
Hmm @pdames would be good to get your thoughts here
What happened + What you expected to happen
If multiple network interfaces with same subnet id are specified in a node config, I get this error: Not all subnet IDs found: {}"
Example node config:
The piece of code causing trouble : ray/python/ray/autoscaler/_private/aws/config.py
I think we should convert the list to a set instead for checking on lengths.
Versions / Dependencies
ray 2.3.0
Reproduction script
relevant aws yaml config snippet is already provided
Issue Severity
High: It blocks me from completing my task.