Closed Suprhimp closed 5 months ago
I found an main problem about this symptom.
OOM error occur in inf2 instance docker. (especially docker oom occurs in only ec2 instance without huggingface ami)
When I use huggingface ami with inf2.xlarge instance, my sagemaker docker environment works very well, without any error.
But when I build and run docker in aws ubuntu ami, it shows me OOM error inside the docker.
System Info
Who can help?
@dacorvo @JingyaHuang
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
https://www.philschmid.de/inferentia2-stable-diffusion-xl
I follow up this example with newest optimum-neuron version and than it failed.
I also set the requirements same as the local environment.
My target is compile sdxl turbo model and then make sagemaker endpoint with my sdxl turbo model, but it faild with this cloudwatch logs.
There is any error log but the workers goes down.
I don't have any idea to handle this.
I faild test with official sdxl model that example use 'stabilityai/stable-diffusion-xl-base-1.0'
In neuronx 2.15 environment compiled sdxl model, it works in sagemaker well. But when I upgrade 2.16 or 2.17, or compile sdxl turbo model even in 2.15 environment. Load model goes faild in sagemaker.
Expected behavior
Surely It has to be work as same as in the EC2 environment.
In the EC2 environment it works very well. (I can load compiled model and I can run predict function)