ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.94k stars 5.58k forks source link

[Cluster, AWS]: Necessary EBS size seems to be suprisingly big #38728

Closed MarselScheer closed 1 year ago

MarselScheer commented 1 year ago

Description

I set up a cluster on AWS following Launching Ray Clusters on AWS using this version of the config AWS-example-full.yaml The disk size in the config is specified as 140 GB and I was very suprised to see that the created head-node already used 129 GB of the 140 GB. The docker-image is around 23 GB, which means that still over 100 GB were already used. Especially if one peeks into the config GCP-example-full.yaml where the configuration reserves only 50 GB for the head-node which seems more reasonable (though i did not tested if the GCP-config really deploys a functional cluster)

Maybe everything that is necessary for the nodes is more than 100 GB, but then it should be noted in the AWS-section, because some users like me think that they did sth. wrong and spend time investigating it :-). In general, I would also be interested in things one could try to free up the disk space.

Link

https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/aws.html

chappidim commented 1 year ago

That's likely because of AMI (Deep learning AMI) baked with number of ML frameworks (please see AMI description). We can always switch to CPU and use some latest AMI (based on OS).

aws ec2 describe-images --image-ids ami-0387d929287ab193e --region us-west-2
{
    "Images": [
        {
            "Architecture": "x86_64",
            "CreationDate": "2022-06-09T17:20:48.000Z",
            "ImageId": "ami-0387d929287ab193e",
            "ImageLocation": "amazon/Deep Learning AMI (Ubuntu 18.04) Version 61.0",
            "ImageType": "machine",
            "Public": true,
            "OwnerId": "898082745236",
            "PlatformDetails": "Linux/UNIX",
            "UsageOperation": "RunInstances",
            "State": "available",
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1",
                    "Ebs": {
                        "DeleteOnTermination": true,
                        "Iops": 3000,
                        "SnapshotId": "snap-04db122cb9d19ae16",
                        "VolumeSize": 140,
                        "VolumeType": "gp3",
                        "Encrypted": false
                    }
                },
                {
                    "DeviceName": "/dev/sdb",
                    "VirtualName": "ephemeral0"
                },
                {
                    "DeviceName": "/dev/sdc",
                    "VirtualName": "ephemeral1"
                }
            ],
            "Description": "MXNet-1.9, TensorFlow-2.7, PyTorch-1.11, Neuron, & others. NVIDIA CUDA, cuDNN, NCCL, Intel MKL-DNN, Docker, NVIDIA-Docker & EFA support. For fully managed experience, check: https://aws.amazon.com/sagemaker",
            "EnaSupport": true,
            "Hypervisor": "xen",
            "ImageOwnerAlias": "amazon",
            "Name": "Deep Learning AMI (Ubuntu 18.04) Version 61.0",
            "RootDeviceName": "/dev/sda1",
            "RootDeviceType": "ebs",
            "SriovNetSupport": "simple",
            "VirtualizationType": "hvm"
        }
    ]
}
MarselScheer commented 1 year ago

Thanks!

Based on your hint, I tried to use the ray documenation for guidance on finding an AMIs. However, I was not really successful in doing that. Anyway, just for the record, here how I found a proper AMI for my purpose (small EBS-volume but CUDA 11 support)

aws ec2 describe-images --region us-west-2 --filters "Name=name,Values=*Deep Learning AMI GPU CUDA 11*"

and then I just skimmed through the available AMIs. Dont know if that is a good way and will always provide AMIs that can be used for a ray-cluster, but at least I was able to deploy a ray-cluster using first AMI (ami-0e274207681ab9030) that I picked.