sdxl compiled model load faild in sagemaker only (it works very well in EC2 instance)

System Info

optimum-neuron version 0.0.16 ~ 0.0.20 all case model load faild.
neuronx version 2.16 / 2.17
I used optimum-neuron compile in aws inf2 instance.

Who can help?

@dacorvo @JingyaHuang

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

https://www.philschmid.de/inferentia2-stable-diffusion-xl

I follow up this example with newest optimum-neuron version and than it failed.

I also set the requirements same as the local environment.

My target is compile sdxl turbo model and then make sagemaker endpoint with my sdxl turbo model, but it faild with this cloudwatch logs.


2024-03-29T13:55:59.045+09:00 | 2024-03-29T04:55:57,004 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Loading only U-Net into both Neuron Cores...
-- | --
  | 2024-03-29T13:56:03.815+09:00 | 2024-03-29T04:55:58,881 [INFO ] pool-2-thread-3 ACCESS_LOG - /169.254.178.2:33828 "GET /ping HTTP/1.1" 200 18
...
...
  | 2024-03-29T13:56:54.923+09:00 | 2024-03-29T04:56:53,820 [INFO ] pool-2-thread-3 ACCESS_LOG - /169.254.178.2:37450 "GET /ping HTTP/1.1" 200 1
  | 2024-03-29T13:56:58.931+09:00 | 2024-03-29T04:56:54,916 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
 ...
 ...
  | 2024-03-29T13:57:21.492+09:00 | 2024-03-29T04:57:21,381 [INFO ] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-7f1e57a2 Worker disconnected. WORKER_STARTED
  | 2024-03-29T13:57:21.492+09:00 | 2024-03-29T04:57:21,384 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.
  | 2024-03-29T13:57:22.494+09:00 | 2024-03-29T04:57:21,385 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-7f1e57a2 in 1 seconds.
  | 2024-03-29T13:57:22.494+09:00 | 2024-03-29T04:57:22,385 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000

There is any error log but the workers goes down.

I don't have any idea to handle this.

I faild test with official sdxl model that example use 'stabilityai/stable-diffusion-xl-base-1.0'

In neuronx 2.15 environment compiled sdxl model, it works in sagemaker well. But when I upgrade 2.16 or 2.17, or compile sdxl turbo model even in 2.15 environment. Load model goes faild in sagemaker.

Expected behavior

Surely It has to be work as same as in the EC2 environment.

In the EC2 environment it works very well. (I can load compiled model and I can run predict function)

huggingface / optimum-neuron