huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
196 stars 59 forks source link

sdxl compiled model load faild in sagemaker only (it works very well in EC2 instance) #540

Closed Suprhimp closed 5 months ago

Suprhimp commented 6 months ago

System Info

optimum-neuron version 0.0.16 ~ 0.0.20 all case model load faild.
neuronx version 2.16 / 2.17
I used optimum-neuron compile in aws inf2 instance.

Who can help?

@dacorvo @JingyaHuang

Information

Tasks

Reproduction (minimal, reproducible, runnable)

https://www.philschmid.de/inferentia2-stable-diffusion-xl

I follow up this example with newest optimum-neuron version and than it failed.

I also set the requirements same as the local environment.

My target is compile sdxl turbo model and then make sagemaker endpoint with my sdxl turbo model, but it faild with this cloudwatch logs.


2024-03-29T13:55:59.045+09:00 | 2024-03-29T04:55:57,004 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Loading only U-Net into both Neuron Cores...
-- | --
  | 2024-03-29T13:56:03.815+09:00 | 2024-03-29T04:55:58,881 [INFO ] pool-2-thread-3 ACCESS_LOG - /169.254.178.2:33828 "GET /ping HTTP/1.1" 200 18
...
...
  | 2024-03-29T13:56:54.923+09:00 | 2024-03-29T04:56:53,820 [INFO ] pool-2-thread-3 ACCESS_LOG - /169.254.178.2:37450 "GET /ping HTTP/1.1" 200 1
  | 2024-03-29T13:56:58.931+09:00 | 2024-03-29T04:56:54,916 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
 ...
 ...
  | 2024-03-29T13:57:21.492+09:00 | 2024-03-29T04:57:21,381 [INFO ] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-7f1e57a2 Worker disconnected. WORKER_STARTED
  | 2024-03-29T13:57:21.492+09:00 | 2024-03-29T04:57:21,384 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.
  | 2024-03-29T13:57:22.494+09:00 | 2024-03-29T04:57:21,385 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-7f1e57a2 in 1 seconds.
  | 2024-03-29T13:57:22.494+09:00 | 2024-03-29T04:57:22,385 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000

There is any error log but the workers goes down.

I don't have any idea to handle this.

I faild test with official sdxl model that example use 'stabilityai/stable-diffusion-xl-base-1.0'

In neuronx 2.15 environment compiled sdxl model, it works in sagemaker well. But when I upgrade 2.16 or 2.17, or compile sdxl turbo model even in 2.15 environment. Load model goes faild in sagemaker.

Expected behavior

Surely It has to be work as same as in the EC2 environment.

In the EC2 environment it works very well. (I can load compiled model and I can run predict function)

Suprhimp commented 6 months ago

I found an main problem about this symptom.

OOM error occur in inf2 instance docker. (especially docker oom occurs in only ec2 instance without huggingface ami)

When I use huggingface ami with inf2.xlarge instance, my sagemaker docker environment works very well, without any error.

But when I build and run docker in aws ubuntu ami, it shows me OOM error inside the docker.