Open dudeperf3ct opened 2 months ago
Hi @dudeperf3ct, if you are running out of space when the image is pulled you likely need to increase size of the EBS root volume attached to the instance. You can refer to this guide on how to do that.
I'd recommend using the block device mapping to provision a larger root volume for any future EKS cluster deployments.
@askulkarni2 The EKS instance started with 150GB disk space. Podman tries to pull the image from ECR in an infinite loop fashion that makes it run out of space.
Attaching a screenshot of logs in raylet.err
. Some of the layers are being pulled multiple times. Only one container is specified in serveconfigV2
above. I expected the application to start once podman pulls all layers but instead it keeps pulling the same container from ECR.
@zcin when you have a sec, can you help looking at this?
What happened + What you expected to happen
I am trying to run the experimental feature of running multiple applications in different containers on EKS.
I will include the exact steps in the Reproduction script section. After deploying the application on EKS,
True
actually messes with the authorizationGuide: https://docs.ray.io/en/latest/serve/advanced-guides/multi-app-container.html
Versions / Dependencies
Ray - 2.11.0 Python - 3.10.14 Official docker image:
rayproject/ray:latest-py310-cpu
Reproduction script
This reproduction script is specific to AWS. Two resources are required for this - ECR and EKS.
Create two repositories on ECR -
translatorapp
andcustomrayimage
Use the following Dockerfile to build and push to ECR
translator.Dockerfile
: Use an example Ray application shown here.custom_ray.Dockerfile
: Sincepodman
is required for this experimental feature, we add it as a dependency and create a custom ray image.Two things needs to be configured here a. Replacing
<aws-ecr-password>
with output ofaws ecr get-login-password --region <your-aws-region>
b. Replacing<accid>.dkr.ecr.<aws-region>.amazonaws.com
with your private URL for ECR.Build and push both the images to the ECR. I used
podman
for this.Create an EKS (I used
m7i.xlarge
instance for testing this).Install the
kuberay
operatorRun the following
serve_config.yaml
on the EKS (kubectl apply -f serve_config.yaml
).serve_config.yaml
: This configuration file for now deploys only one container but we can easily extendserveConfigV2
to add multiple containers.I also added the following to ray head and worker group spec but adding these in, podman was not able to pull images from ECR.
Issue Severity
Low: It annoys or frustrates me.