gilvikra commented 1 year ago

What happened + What you expected to happen

If multiple network interfaces with same subnet id are specified in a node config, I get this error: Not all subnet IDs found: {}"

Example node config:

    node_config:
      InstanceType: trn1.32xlarge
      ImageId: ami-042b39567497b9285
      UserData: "\n#!/bin/bash\nyum install -y htop\nyum install -y amazon-cloudwatch-agent\n/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a start  -m ec2 -c default\n"
      IamInstanceProfile:
        Arn: arn:aws:iam::146036223160:instance-profile/ray-autoscaler-v1
      EbsOptimized: True
      Placement:
        GroupName: A9VSPhoton1
      NetworkInterfaces:
        - AssociatePublicIpAddress: False
          DeleteOnTermination: True
          InterfaceType: efa
          SubnetId: subnet-094edaca2285f2d5f
          Groups: [sg-0b7b434da6b0c24c2]
          DeviceIndex: 0
          NetworkCardIndex: 0
        - AssociatePublicIpAddress: False
          DeleteOnTermination: True
          InterfaceType: efa
          SubnetId: subnet-094edaca2285f2d5f
          Groups: [sg-0b7b434da6b0c24c2]
          DeviceIndex: 1
          NetworkCardIndex: 1

The piece of code causing trouble : ray/python/ray/autoscaler/_private/aws/config.py

@lru_cache()
def _get_subnets_or_die(ec2, subnet_ids: Tuple[str]):
    subnets = list(
        ec2.subnets.filter(Filters=[{"Name": "subnet-id", "Values": list(subnet_ids)}])
    )

    # TODO: better error message
    cli_logger.doassert(
        len(subnets) == len(subnet_ids), "Not all subnet IDs found: {}", subnet_ids
    )
    assert len(subnets) == len(subnet_ids), "Subnet ID not found: {}".format(subnet_ids)
    return subnets

I think we should convert the list to a set instead for checking on lengths.

Versions / Dependencies

ray 2.3.0

Reproduction script

relevant aws yaml config snippet is already provided

Issue Severity

High: It blocks me from completing my task.

wuisawesome commented 1 year ago

@gilvikra can you say more about why you have duplicate entries?

My initial impression is that I'm not 100% sure if this is desirable behavior since we currently tend towards just transparently passing the value through to aws.

gilvikra commented 1 year ago

This is a serious bug. Or please provide a workaround. For distributed training between multiple instances we use as many network interfaces as possible. Like for p4 we will use 4, for trn1.32xlarge we will use 8. ( For reference the corresponding cloudformation template for trainium: https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/dp_bert_hf_pretrain/cfn/eks_trn1_ng_stack.yaml) Right now the below pasted piece of config fails with error:

Not all subnet IDs found: ('subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f', 'subnet-094edaca2285f2d5f')

worker_trn:

To experiment with autoscaling, set min_workers to 0.

min_workers: 2 
max_workers: 16
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {"custom_trn_vcpu": 128, "custom_trn_nc": 32}
#resources: {"custom_trn_vcpu": 8, "custom_trn_nc": 2}
node_config:
  InstanceType: trn1.32xlarge
  #InstanceType: trn1.2xlarge
  ImageId: ami-0c8e2149bcc9ba840
  UserData: "\n#!/bin/bash \n\ntouch /home/ec2-user/TRN1_MC\nprintf '[neuron]\nname=Neuron YUM Repository\nbaseurl=https://yum.repos.neuron.amazonaws.com\nenabled=1\nmetadata_expire=0\n' > /etc/yum.repos.d/neuron.repo\nrpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB\n\nyum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y\n\nyum update -y\n\nyum install git -y\n\nyum remove aws-neuron-dkms -y\nyum remove aws-neuronx-dkms -y\nyum remove aws-neuronx-oci-hook -y\nyum remove aws-neuronx-runtime-lib -y\nyum remove aws-neuronx-collectives -y\nyum install aws-neuronx-dkms-2.*  -y\nyum install aws-neuronx-oci-hook-2.*  -y\nyum install aws-neuronx-runtime-lib-2.*  -y\nyum install aws-neuronx-collectives-2.*  -y\n\n\nyum remove aws-neuron-tools  -y\nyum remove aws-neuronx-tools  -y\nyum install aws-neuronx-tools-2.*  -y\n\nexport PATH=/opt/aws/neuron/bin:$PATH\n\nyum  install -y  nvme-cli\n\nyum install -y mesa-libGL\nyum install -y htop\nyum install -y amazon-cloudwatch-agent\n/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a start  -m ec2 -c default\n\n\namazon-linux-extras install lustre python3.8 -y\nwget https://bootstrap.pypa.io/get-pip.py\nupdate-alternatives --install /usr/bin/python python /usr/bin/python3.8 1\npython3.8 get-pip.py\n\n\npython3.8 -m pip config set global.extra-index-url 'https://pip.repos.neuron.amazonaws.com'\n\npython3.8 -m pip install neuronx-cc==2.* torch-neuronx torchvision torchmetrics\n\ntouch /home/ec2-user/TRN1_SETUP_DONE\ntouch /home/ec2-user/USER_DATA_SETUP_DONE\n"  
  IamInstanceProfile:
    Arn: arn:aws:iam::146036223160:instance-profile/ray-autoscaler-v1
  KeyName: ray-alpha-us-east-1-key-pair
  EbsOptimized: True
  Placement:
    GroupName: A9VSPhotonUsEast1b
  # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#network-cards    
  NetworkInterfaces:
    - AssociatePublicIpAddress: True
      DeleteOnTermination: True
      InterfaceType: efa
      SubnetId: subnet-094edaca2285f2d5f
      Groups: [sg-08d7bdf993ee48e52]
      DeviceIndex: 0
      NetworkCardIndex: 0
    - AssociatePublicIpAddress: False 
      DeleteOnTermination: True
      InterfaceType: efa
      SubnetId: subnet-094edaca2285f2d5f
      Groups: [sg-08d7bdf993ee48e52]
      DeviceIndex: 1 
      NetworkCardIndex: 1 
    - AssociatePublicIpAddress: False 
      DeleteOnTermination: True
      InterfaceType: efa
      SubnetId: subnet-094edaca2285f2d5f
      Groups: [sg-08d7bdf993ee48e52]
      DeviceIndex: 2 
      NetworkCardIndex: 2 
    - AssociatePublicIpAddress: False 
      DeleteOnTermination: True
      InterfaceType: efa
      SubnetId: subnet-094edaca2285f2d5f
      Groups: [sg-08d7bdf993ee48e52]
      DeviceIndex: 3 
      NetworkCardIndex: 3 
    - AssociatePublicIpAddress: False 
      DeleteOnTermination: True
      InterfaceType: efa
      SubnetId: subnet-094edaca2285f2d5f
      Groups: [sg-08d7bdf993ee48e52]
      DeviceIndex: 4 
      NetworkCardIndex: 4 
    - AssociatePublicIpAddress: False 
      DeleteOnTermination: True
      InterfaceType: efa
      SubnetId: subnet-094edaca2285f2d5f
      Groups: [sg-08d7bdf993ee48e52]
      DeviceIndex: 5 
      NetworkCardIndex: 5 
    - AssociatePublicIpAddress: False 
      DeleteOnTermination: True
      InterfaceType: efa
      SubnetId: subnet-094edaca2285f2d5f
      Groups: [sg-08d7bdf993ee48e52]
      DeviceIndex: 6 
      NetworkCardIndex: 6 
    - AssociatePublicIpAddress: False 
      DeleteOnTermination: True
      InterfaceType: efa
      SubnetId: subnet-094edaca2285f2d5f
      Groups: [sg-08d7bdf993ee48e52]
      DeviceIndex: 7 
      NetworkCardIndex: 7 
  BlockDeviceMappings:
    # root device is xvda for al2, ubunto cannot mount more than 2TB by default as root volume, https://aws.amazon.com/premiumsupport/knowledge-center/ec2-ubuntu-convert-mbr-to-gpt/, https://www.dolthub.com/blog/2022-05-02-use-more-than-2TB-ubuntu-ec2/
    #- DeviceName: /dev/sdb
    - DeviceName: /dev/xvda
      Ebs:
        VolumeSize: 1000
        VolumeType: gp3
        #VolumeType: io2
        # can go up to 64000
        #Iops: 15000

wuisawesome commented 1 year ago

Hmm @pdames would be good to get your thoughts here

ray-project / ray

[aws][autoscaler] Allow duplicate entries in NetworkInterfaces #33586

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

To experiment with autoscaling, set min_workers to 0.