Closed MarselScheer closed 1 year ago
That's likely because of AMI (Deep learning AMI) baked with number of ML frameworks (please see AMI description). We can always switch to CPU and use some latest AMI (based on OS).
aws ec2 describe-images --image-ids ami-0387d929287ab193e --region us-west-2
{
"Images": [
{
"Architecture": "x86_64",
"CreationDate": "2022-06-09T17:20:48.000Z",
"ImageId": "ami-0387d929287ab193e",
"ImageLocation": "amazon/Deep Learning AMI (Ubuntu 18.04) Version 61.0",
"ImageType": "machine",
"Public": true,
"OwnerId": "898082745236",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": true,
"Iops": 3000,
"SnapshotId": "snap-04db122cb9d19ae16",
"VolumeSize": 140,
"VolumeType": "gp3",
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"Description": "MXNet-1.9, TensorFlow-2.7, PyTorch-1.11, Neuron, & others. NVIDIA CUDA, cuDNN, NCCL, Intel MKL-DNN, Docker, NVIDIA-Docker & EFA support. For fully managed experience, check: https://aws.amazon.com/sagemaker",
"EnaSupport": true,
"Hypervisor": "xen",
"ImageOwnerAlias": "amazon",
"Name": "Deep Learning AMI (Ubuntu 18.04) Version 61.0",
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}
]
}
Thanks!
Based on your hint, I tried to use the ray documenation for guidance on finding an AMIs. However, I was not really successful in doing that. Anyway, just for the record, here how I found a proper AMI for my purpose (small EBS-volume but CUDA 11 support)
aws ec2 describe-images --region us-west-2 --filters "Name=name,Values=*Deep Learning AMI GPU CUDA 11*"
and then I just skimmed through the available AMIs. Dont know if that is a good way and will always provide AMIs that can be used for a ray-cluster, but at least I was able to deploy a ray-cluster using first AMI (ami-0e274207681ab9030) that I picked.
Description
I set up a cluster on AWS following Launching Ray Clusters on AWS using this version of the config AWS-example-full.yaml The disk size in the config is specified as 140 GB and I was very suprised to see that the created head-node already used 129 GB of the 140 GB. The docker-image is around 23 GB, which means that still over 100 GB were already used. Especially if one peeks into the config GCP-example-full.yaml where the configuration reserves only 50 GB for the head-node which seems more reasonable (though i did not tested if the GCP-config really deploys a functional cluster)
Maybe everything that is necessary for the nodes is more than 100 GB, but then it should be noted in the AWS-section, because some users like me think that they did sth. wrong and spend time investigating it :-). In general, I would also be interested in things one could try to free up the disk space.
Link
https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/aws.html