pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 855 forks source link

Torchserve not starting for diffusers example #3345

Open dummyuser-123 opened 1 week ago

dummyuser-123 commented 1 week ago

🐛 Describe the bug

I was running diffusers using torchserve through this tutorial given in the readme file. But I am not able to start the torchserve after correctly following all the instruction properly.

Error logs

(env) D:\Text-to-Image\only cartoon torchserve\diffusers>torchserve --start --ts-config config.properties --disable-token-auth --enable-model-api

(env) D:\Text-to-Image\only cartoon torchserve\diffusers>WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 2024-10-10T11:31:05,290 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": CreateProcess error=2, The system cannot find the file specified 2024-10-10T11:31:05,290 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties 2024-10-10T11:31:05,321 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager... 2024-10-10T11:31:05,352 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages/ts/configs/metrics.yaml 2024-10-10T11:31:05,446 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.12.0 TS Home: D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages Current directory: D:\Text-to-Image\only cartoon torchserve\diffusers Temp directory: C:\Users\Win\AppData\Local\Temp Metrics config path: D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages/ts/configs/metrics.yaml Number of GPUs: 1 Number of CPUs: 12 Max heap size: 4056 M Python executable: D:\Text-to-Image\only cartoon torchserve\env\Scripts\python.exe Config file: config.properties Inference address: http://127.0.0.1:8080 Management address: http://127.0.0.1:8081 Metrics address: http://127.0.0.1:8082 Model Store: D:\Text-to-Image\only cartoon torchserve\diffusers Initial Models: all Log dir: D:\Text-to-Image\only cartoon torchserve\diffusers\logs Metrics dir: D:\Text-to-Image\only cartoon torchserve\diffusers\logs Netty threads: 0 Netty client threads: 0 Default workers per model: 1 Blacklist Regex: N/A Maximum Response Size: 655350000 Maximum Request Size: 6553500 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.|http(s)?://.] Custom python dependency for model allowed: true Enable metrics API: true Metrics mode: LOG Disable system metrics: true Workflow Store: D:\Text-to-Image\only cartoon torchserve\diffusers CPP log config: N/A Model config: N/A System metrics command: default Model API enabled: true 2024-10-10T11:31:05,462 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin... 2024-10-10T11:31:05,462 [DEBUG] main org.pytorch.serve.ModelServer - Loading models from model store: Diffusion_model 2024-10-10T11:31:05,483 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir C:\Users\Win\AppData\Local\Temp\models\9075562fcbd8442ab843e59a5b06db52 2024-10-10T11:31:05,483 [WARN ] main org.pytorch.serve.ModelServer - Failed to load model: D:\Text-to-Image\only cartoon torchserve\diffusers\Diffusion_model java.nio.file.FileSystemException: C:\Users\Win\AppData\Local\Temp\models\9075562fcbd8442ab843e59a5b06db52\model: A required privilege is not held by the client at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) ~[?:?] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) ~[?:?] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108) ~[?:?] at sun.nio.fs.WindowsFileSystemProvider.createSymbolicLink(WindowsFileSystemProvider.java:604) ~[?:?] at java.nio.file.Files.createSymbolicLink(Files.java:1070) ~[?:?] at org.pytorch.serve.archive.utils.ZipUtils.createSymbolicDir(ZipUtils.java:159) ~[model-server.jar:?] at org.pytorch.serve.archive.model.ModelArchive.downloadModel(ModelArchive.java:94) ~[model-server.jar:?] at org.pytorch.serve.wlm.ModelManager.createModelArchive(ModelManager.java:185) ~[model-server.jar:?] at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:143) ~[model-server.jar:?] at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:74) ~[model-server.jar:?] at org.pytorch.serve.ModelServer.initModelStore(ModelServer.java:205) [model-server.jar:?] at org.pytorch.serve.ModelServer.startRESTserver(ModelServer.java:399) [model-server.jar:?] at org.pytorch.serve.ModelServer.startAndWait(ModelServer.java:124) [model-server.jar:?] at org.pytorch.serve.ModelServer.main(ModelServer.java:105) [model-server.jar:?] 2024-10-10T11:31:05,483 [DEBUG] main org.pytorch.serve.ModelServer - Loading models from model store: logs 2024-10-10T11:31:05,483 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir C:\Users\Win\AppData\Local\Temp\models\42efbe4c41a846ab82c229b237ce0d97 2024-10-10T11:31:05,483 [WARN ] main org.pytorch.serve.ModelServer - Failed to load model: D:\Text-to-Image\only cartoon torchserve\diffusers\logs java.nio.file.FileSystemException: C:\Users\Win\AppData\Local\Temp\models\42efbe4c41a846ab82c229b237ce0d97\logs: A required privilege is not held by the client at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) ~[?:?] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) ~[?:?] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108) ~[?:?] at sun.nio.fs.WindowsFileSystemProvider.createSymbolicLink(WindowsFileSystemProvider.java:604) ~[?:?] at java.nio.file.Files.createSymbolicLink(Files.java:1070) ~[?:?] at org.pytorch.serve.archive.utils.ZipUtils.createSymbolicDir(ZipUtils.java:159) ~[model-server.jar:?] at org.pytorch.serve.archive.model.ModelArchive.downloadModel(ModelArchive.java:98) ~[model-server.jar:?] at org.pytorch.serve.wlm.ModelManager.createModelArchive(ModelManager.java:185) ~[model-server.jar:?] at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:143) ~[model-server.jar:?] at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:74) ~[model-server.jar:?] at org.pytorch.serve.ModelServer.initModelStore(ModelServer.java:205) [model-server.jar:?] at org.pytorch.serve.ModelServer.startRESTserver(ModelServer.java:399) [model-server.jar:?] at org.pytorch.serve.ModelServer.startAndWait(ModelServer.java:124) [model-server.jar:?] at org.pytorch.serve.ModelServer.main(ModelServer.java:105) [model-server.jar:?] 2024-10-10T11:31:05,483 [DEBUG] main org.pytorch.serve.ModelServer - Loading models from model store: stable-diffusion.mar 2024-10-10T11:31:38,874 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model stable-diffusion 2024-10-10T11:31:38,874 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model stable-diffusion 2024-10-10T11:32:06,070 [INFO ] main org.pytorch.serve.wlm.ModelManager - Installed custom pip packages for model stable-diffusion 2024-10-10T11:32:06,070 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model stable-diffusion loaded. 2024-10-10T11:32:06,070 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: stable-diffusion, count: 1 2024-10-10T11:32:06,070 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: NioServerSocketChannel. 2024-10-10T11:32:06,070 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\only cartoon torchserve\env\Scripts\python.exe, D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9000, --metrics-config, D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages/ts/configs/metrics.yaml] 2024-10-10T11:32:06,118 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080 2024-10-10T11:32:06,118 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: NioServerSocketChannel. 2024-10-10T11:32:06,118 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081 2024-10-10T11:32:06,118 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: NioServerSocketChannel. 2024-10-10T11:32:06,118 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082 Model server started. 2024-10-10T11:32:07,613 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Listening on addr:port: 127.0.0.1:9000 2024-10-10T11:32:07,613 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Successfully loaded D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages/ts/configs/metrics.yaml. 2024-10-10T11:32:07,613 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - [PID]6988 2024-10-10T11:32:07,613 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Torch worker started. 2024-10-10T11:32:07,613 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Python runtime: 3.10.6 2024-10-10T11:32:07,613 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-stable-diffusion_1.0 State change null -> WORKER_STARTED 2024-10-10T11:32:07,613 [INFO ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /127.0.0.1:9000 2024-10-10T11:32:07,613 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Connection accepted: ('127.0.0.1', 9000). 2024-10-10T11:32:07,628 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1728540127628 2024-10-10T11:32:07,628 [INFO ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1728540127628 2024-10-10T11:32:07,644 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - model_name: stable-diffusion, batchSize: 1 2024-10-10T11:32:08,883 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Enabled tensor cores 2024-10-10T11:32:08,883 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - OpenVINO is not enabled 2024-10-10T11:32:08,883 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - proceeding without onnxruntime 2024-10-10T11:32:08,883 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Torch TensorRT not enabled 2024-10-10T11:32:08,883 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Diffusers version 0.6.0 2024-10-10T11:32:08,883 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - initialized function called 2024-10-10T11:32:38,017 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Diffusion model Extracted successfully 2024-10-10T11:32:38,017 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Backend worker process died. 2024-10-10T11:32:38,017 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Traceback (most recent call last): 2024-10-10T11:32:38,017 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - File "D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages\ts\model_service_worker.py", line 301, in 2024-10-10T11:32:38,017 [INFO ] nioEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2024-10-10T11:32:38,017 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - worker.run_server() 2024-10-10T11:32:38,027 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - File "D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages\ts\model_service_worker.py", line 268, in run_server 2024-10-10T11:32:38,027 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2024-10-10T11:32:38,027 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket) 2024-10-10T11:32:38,027 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died., startupTimeout:360sec java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1770) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:435) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:234) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.lang.Thread.run(Thread.java:1575) [?:?] 2024-10-10T11:32:38,027 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - File "D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages\ts\model_service_worker.py", line 196, in handle_connection 2024-10-10T11:32:38,028 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg) 2024-10-10T11:32:38,028 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - File "D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages\ts\model_service_worker.py", line 133, in load_model 2024-10-10T11:32:38,028 [WARN ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: stable-diffusion, error: Worker died. 2024-10-10T11:32:38,028 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - service = model_loader.load( 2024-10-10T11:32:38,029 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-stable-diffusion_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - File "D:\Text-to-Image\only cartoon torchserve\env\lib\site-packages\ts\model_loader.py", line 143, in load 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1728540158029 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - initialize_fn(service.context) 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - File "C:\Users\Win\AppData\Local\Temp\models\7104be7eba2d4971bcbc3dcc27f2b599\stable_diffusion_handler.py", line 48, in initialize 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - self.pipe = DiffusionPipeline.from_pretrained(model_dir + "/model") 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - File "C:\Users\Win\AppData\Local\Temp\models\7104be7eba2d4971bcbc3dcc27f2b599\diffusers\pipeline_utils.py", line 403, in from_pretrained 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - config_dict = cls.get_config_dict(cached_folder) 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - File "C:\Users\Win\AppData\Local\Temp\models\7104be7eba2d4971bcbc3dcc27f2b599\diffusers\configuration_utils.py", line 217, in get_config_dict 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - raise EnvironmentError( 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - OSError: Error no file named model_index.json found in directory C:\Users\Win\AppData\Local\Temp\models\7104be7eba2d4971bcbc3dcc27f2b599/model. 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-stable-diffusion_1.0-stdout 2024-10-10T11:32:38,029 [INFO ] W-9000-stable-diffusion_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-stable-diffusion_1.0-stderr 2024-10-10T11:32:39,034 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\only cartoon torchserve\env\Scripts\python.exe, D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9000, --metrics-config, D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages/ts/configs/metrics.yaml] 2024-10-10T11:32:40,367 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Listening on addr:port: 127.0.0.1:9000 2024-10-10T11:32:40,372 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Successfully loaded D:\Text-to-Image\only cartoon torchserve\env\Lib\site-packages/ts/configs/metrics.yaml. 2024-10-10T11:32:40,372 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - [PID]14112 2024-10-10T11:32:40,372 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Torch worker started. 2024-10-10T11:32:40,372 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Python runtime: 3.10.6 2024-10-10T11:32:40,372 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-stable-diffusion_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2024-10-10T11:32:40,372 [INFO ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /127.0.0.1:9000 2024-10-10T11:32:40,372 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Connection accepted: ('127.0.0.1', 9000). 2024-10-10T11:32:40,372 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1728540160372 2024-10-10T11:32:40,372 [INFO ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1728540160372 2024-10-10T11:32:40,399 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - model_name: stable-diffusion, batchSize: 1 2024-10-10T11:32:41,293 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Enabled tensor cores 2024-10-10T11:32:41,293 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - OpenVINO is not enabled 2024-10-10T11:32:41,293 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - proceeding without onnxruntime 2024-10-10T11:32:41,293 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Torch TensorRT not enabled 2024-10-10T11:32:41,293 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - Diffusers version 0.6.0 2024-10-10T11:32:41,293 [INFO ] W-9000-stable-diffusion_1.0-stdout MODEL_LOG - initialized function called

Installation instructions

Yes I have installed torchserve from source

Model Packaging

I have followed this link to package the model

config.properties

inference_address=http://127.0.0.1:8080 management_address=http://127.0.0.1:8081 metrics_address=http://127.0.0.1:8082 enable_envvars_config=true install_py_dep_per_model=true load_models=all max_response_size=655350000 disable_system_metrics=true model_store=D:/Text-to-Image/only cartoon torchserve/diffusers default_startup_timeout=360

Versions


Environment headers

Torchserve branch:

torchserve==0.12.0 torch-model-archiver==0.12.0

Python version: 3.10 (64-bit runtime) Python executable: D:\Text-to-Image\only cartoon torchserve\env\Scripts\python.exe

Versions of relevant python libraries: numpy==2.1.2 torch==2.4.1+cu118 torch-model-archiver==0.12.0 torchserve==0.12.0 torch==2.4.1+cu118 Warning: torchtext not present .. Warning: torchvision not present .. **Warning: torchaudio not present ..

Java Version:

OS: Microsoft Windows 11 Pro GCC version: N/A Clang version: N/A CMake version: version 3.27.9

Is CUDA available: Yes CUDA runtime version: 11.8.89 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3050 Nvidia driver version: 537.58 cuDNN version: None

Repro instructions

I have followed instructions from link and I got error while running step 4

Possible Solution

No response

dummyuser-123 commented 1 week ago

After running command prompt as administrator, I getting this error now


(ts_diff_env) D:\Text-to-Image\torchserve_diffusers>torchserve --start --ts-config config.properties --disable-token-auth  --enable-model-api

(ts_diff_env) D:\Text-to-Image\torchserve_diffusers>WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-10-10T17:47:49,728 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": CreateProcess error=2, The system cannot find the file specified
2024-10-10T17:47:49,744 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-10-10T17:47:49,759 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-10-10T17:47:49,791 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml
2024-10-10T17:47:49,901 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.12.0
TS Home: D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages
Current directory: D:\Text-to-Image\torchserve_diffusers
Temp directory: C:\Users\Win\AppData\Local\Temp
Metrics config path: D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 12
Max heap size: 4056 M
Python executable: D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: D:\Text-to-Image\torchserve_diffusers
Initial Models: all
Log dir: D:\Text-to-Image\torchserve_diffusers\logs
Metrics dir: D:\Text-to-Image\torchserve_diffusers\logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 655350000
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: true
Workflow Store: D:\Text-to-Image\torchserve_diffusers
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: true
2024-10-10T17:47:49,901 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-10-10T17:47:49,916 [DEBUG] main org.pytorch.serve.ModelServer - Loading models from model store: Diffusion_model
2024-10-10T17:47:49,932 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir C:\Users\Win\AppData\Local\Temp\models\084876218b504c9fb5ebc93827d2ca42
2024-10-10T17:47:49,932 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir C:\Users\Win\AppData\Local\Temp\models\084876218b504c9fb5ebc93827d2ca42\Diffusion_model
2024-10-10T17:47:49,932 [WARN ] main org.pytorch.serve.archive.model.ModelArchive - Model archive version is not defined. Please upgrade to torch-model-archiver 0.2.0 or higher
2024-10-10T17:47:49,932 [WARN ] main org.pytorch.serve.archive.model.ModelArchive - Model archive createdOn is not defined. Please upgrade to torch-model-archiver 0.2.0 or higher
2024-10-10T17:47:49,932 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model Diffusion_model
2024-10-10T17:47:49,932 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model Diffusion_model
2024-10-10T17:47:49,932 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model Diffusion_model loaded.
2024-10-10T17:47:49,932 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: Diffusion_model, count: 1
2024-10-10T17:47:49,932 [DEBUG] main org.pytorch.serve.ModelServer - Loading models from model store: logs
2024-10-10T17:47:49,932 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir C:\Users\Win\AppData\Local\Temp\models\217d664a8a064550b24c55d580ac5267
2024-10-10T17:47:49,932 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir C:\Users\Win\AppData\Local\Temp\models\217d664a8a064550b24c55d580ac5267\logs
2024-10-10T17:47:49,932 [WARN ] main org.pytorch.serve.archive.model.ModelArchive - Model archive version is not defined. Please upgrade to torch-model-archiver 0.2.0 or higher
2024-10-10T17:47:49,932 [DEBUG] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9000, --metrics-config, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-10T17:47:49,948 [WARN ] main org.pytorch.serve.archive.model.ModelArchive - Model archive createdOn is not defined. Please upgrade to torch-model-archiver 0.2.0 or higher
2024-10-10T17:47:49,948 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model logs
2024-10-10T17:47:49,948 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model logs
2024-10-10T17:47:49,948 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model logs loaded.
2024-10-10T17:47:49,948 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: logs, count: 1
2024-10-10T17:47:49,948 [DEBUG] main org.pytorch.serve.ModelServer - Loading models from model store: stable-diffusion.mar
2024-10-10T17:47:49,948 [DEBUG] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9001, --metrics-config, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-10T17:47:49,948 [ERROR] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Failed start worker process
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:210) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:106) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:375) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:192) ~[model-server.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1575) [?:?]
Caused by: java.io.IOException: Cannot run program "D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe" (in directory "C:\Users\Win\AppData\Local\Temp\models\217d664a8a064550b24c55d580ac5267\logs"): CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1170) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 8 more
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessImpl.create(Native Method) ~[?:?]
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:500) ~[?:?]
        at java.lang.ProcessImpl.start(ProcessImpl.java:159) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1126) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 8 more
2024-10-10T17:47:49,945 [ERROR] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Failed start worker process
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:210) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:106) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:375) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:192) ~[model-server.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1575) [?:?]
Caused by: java.io.IOException: Cannot run program "D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe" (in directory "C:\Users\Win\AppData\Local\Temp\models\084876218b504c9fb5ebc93827d2ca42\Diffusion_model"): CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1170) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 8 more
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessImpl.create(Native Method) ~[?:?]
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:500) ~[?:?]
        at java.lang.ProcessImpl.start(ProcessImpl.java:159) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1126) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 8 more
2024-10-10T17:47:49,956 [DEBUG] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-logs_1.0 State change null -> WORKER_STOPPED
2024-10-10T17:47:49,957 [DEBUG] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-Diffusion_model_1.0 State change null -> WORKER_STOPPED
2024-10-10T17:47:49,961 [INFO ] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1728562669961
2024-10-10T17:47:49,966 [INFO ] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1728562669966
2024-10-10T17:47:49,967 [INFO ] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9001 in 1 seconds.
2024-10-10T17:47:49,967 [INFO ] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2024-10-10T17:47:50,968 [DEBUG] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9000, --metrics-config, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-10T17:47:50,968 [DEBUG] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9001, --metrics-config, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-10T17:47:50,971 [ERROR] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Failed start worker process
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:210) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:106) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:375) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:192) ~[model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1575) [?:?]
Caused by: java.io.IOException: Cannot run program "D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe" (in directory "C:\Users\Win\AppData\Local\Temp\models\084876218b504c9fb5ebc93827d2ca42\Diffusion_model"): CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1170) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessImpl.create(Native Method) ~[?:?]
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:500) ~[?:?]
        at java.lang.ProcessImpl.start(ProcessImpl.java:159) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1126) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
2024-10-10T17:47:50,971 [ERROR] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Failed start worker process
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:210) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:106) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:375) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:192) ~[model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1575) [?:?]
Caused by: java.io.IOException: Cannot run program "D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe" (in directory "C:\Users\Win\AppData\Local\Temp\models\217d664a8a064550b24c55d580ac5267\logs"): CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1170) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessImpl.create(Native Method) ~[?:?]
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:500) ~[?:?]
        at java.lang.ProcessImpl.start(ProcessImpl.java:159) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1126) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
2024-10-10T17:47:50,971 [DEBUG] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-Diffusion_model_1.0 State change WORKER_STOPPED -> WORKER_STOPPED
2024-10-10T17:47:50,972 [DEBUG] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-logs_1.0 State change WORKER_STOPPED -> WORKER_STOPPED
2024-10-10T17:47:50,973 [WARN ] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2024-10-10T17:47:50,973 [WARN ] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2024-10-10T17:47:50,974 [INFO ] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2024-10-10T17:47:50,974 [INFO ] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9001 in 1 seconds.
2024-10-10T17:47:51,978 [DEBUG] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9000, --metrics-config, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-10T17:47:51,978 [DEBUG] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9001, --metrics-config, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-10T17:47:51,982 [ERROR] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Failed start worker process
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:210) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:106) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:375) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:192) ~[model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1575) [?:?]
Caused by: java.io.IOException: Cannot run program "D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe" (in directory "C:\Users\Win\AppData\Local\Temp\models\217d664a8a064550b24c55d580ac5267\logs"): CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1170) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessImpl.create(Native Method) ~[?:?]
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:500) ~[?:?]
        at java.lang.ProcessImpl.start(ProcessImpl.java:159) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1126) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
2024-10-10T17:47:51,982 [ERROR] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Failed start worker process
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:210) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:106) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:375) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:192) ~[model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1575) [?:?]
Caused by: java.io.IOException: Cannot run program "D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe" (in directory "C:\Users\Win\AppData\Local\Temp\models\084876218b504c9fb5ebc93827d2ca42\Diffusion_model"): CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1170) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessImpl.create(Native Method) ~[?:?]
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:500) ~[?:?]
        at java.lang.ProcessImpl.start(ProcessImpl.java:159) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1126) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
2024-10-10T17:47:51,983 [DEBUG] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-logs_1.0 State change WORKER_STOPPED -> WORKER_STOPPED
2024-10-10T17:47:51,984 [DEBUG] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-Diffusion_model_1.0 State change WORKER_STOPPED -> WORKER_STOPPED
2024-10-10T17:47:51,984 [WARN ] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2024-10-10T17:47:51,985 [WARN ] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2024-10-10T17:47:51,985 [INFO ] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9001 in 2 seconds.
2024-10-10T17:47:51,991 [INFO ] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 2 seconds.
2024-10-10T17:47:53,997 [DEBUG] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9001, --metrics-config, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-10T17:47:53,997 [DEBUG] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9000, --metrics-config, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-10T17:47:53,997 [ERROR] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Failed start worker process
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:210) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:106) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:375) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:192) ~[model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1575) [?:?]
Caused by: java.io.IOException: Cannot run program "D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe" (in directory "C:\Users\Win\AppData\Local\Temp\models\217d664a8a064550b24c55d580ac5267\logs"): CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1170) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessImpl.create(Native Method) ~[?:?]
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:500) ~[?:?]
        at java.lang.ProcessImpl.start(ProcessImpl.java:159) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1126) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
2024-10-10T17:47:53,997 [ERROR] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Failed start worker process
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:210) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:106) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:375) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:192) ~[model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1575) [?:?]
Caused by: java.io.IOException: Cannot run program "D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe" (in directory "C:\Users\Win\AppData\Local\Temp\models\084876218b504c9fb5ebc93827d2ca42\Diffusion_model"): CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1170) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
        at java.lang.ProcessImpl.create(Native Method) ~[?:?]
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:500) ~[?:?]
        at java.lang.ProcessImpl.start(ProcessImpl.java:159) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1126) ~[?:?]
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1089) ~[?:?]
        at java.lang.Runtime.exec(Runtime.java:681) ~[?:?]
        at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:192) ~[model-server.jar:?]
        ... 6 more
2024-10-10T17:47:54,000 [DEBUG] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-logs_1.0 State change WORKER_STOPPED -> WORKER_STOPPED
2024-10-10T17:47:54,001 [DEBUG] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-Diffusion_model_1.0 State change WORKER_STOPPED -> WORKER_STOPPED
2024-10-10T17:47:54,002 [WARN ] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2024-10-10T17:47:54,002 [WARN ] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2024-10-10T17:47:54,003 [INFO ] W-9001-logs_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9001 in 3 seconds.
2024-10-10T17:47:54,004 [INFO ] W-9000-Diffusion_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 3 seconds.

(ts_diff_env) D:\Text-to-Image\torchserve_diffusers>

Don't why it is happening with diffusers example because I have cross checked with alexnet example given in the repo and it is working perfectly.

mreso commented 1 week ago

Hi @dummyuser-123 thats strange, can you post the output of the successful run with alexnet?

dummyuser-123 commented 6 days ago

Sure, This logs are from torchserve side:

(ts_env) D:\Text-to-Image\torchserve>torchserve --start --model-store model_store --models alexnet=alexnet.mar --disable-token-auth  --enable-model-api

(ts_env) D:\Text-to-Image\torchserve>WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-10-11T09:49:44,142 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-10-11T09:49:44,213 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": CreateProcess error=2, The system cannot find the file specified
2024-10-11T09:49:44,214 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-10-11T09:49:44,230 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-10-11T09:49:44,268 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from D:\Text-to-Image\torchserve\ts_env\Lib\site-packages/ts/configs/metrics.yaml
2024-10-11T09:49:44,366 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.12.0
TS Home: D:\Text-to-Image\torchserve\ts_env\Lib\site-packages
Current directory: D:\Text-to-Image\torchserve
Temp directory: C:\Users\Win\AppData\Local\Temp
Metrics config path: D:\Text-to-Image\torchserve\ts_env\Lib\site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 12
Max heap size: 4056 M
Python executable: D:\Text-to-Image\torchserve\ts_env\Scripts\python.exe
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: D:\Text-to-Image\torchserve\model_store
Initial Models: alexnet=alexnet.mar
Log dir: D:\Text-to-Image\torchserve\logs
Metrics dir: D:\Text-to-Image\torchserve\logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: D:\Text-to-Image\torchserve\model_store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: true
2024-10-11T09:49:44,371 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: alexnet.mar
2024-10-11T09:49:47,050 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model alexnet
2024-10-11T09:49:47,050 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model alexnet
2024-10-11T09:49:47,050 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model alexnet loaded.
2024-10-11T09:49:47,051 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: alexnet, count: 1
2024-10-11T09:49:47,058 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: NioServerSocketChannel.
2024-10-11T09:49:47,059 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve\ts_env\Scripts\python.exe, D:\Text-to-Image\torchserve\ts_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9000, --metrics-config, D:\Text-to-Image\torchserve\ts_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-11T09:49:47,102 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-10-11T09:49:47,103 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: NioServerSocketChannel.
2024-10-11T09:49:47,104 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-10-11T09:49:47,104 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: NioServerSocketChannel.
2024-10-11T09:49:47,105 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-10-11T09:49:47,258 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2024-10-11T09:49:47,754 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:47,756 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:72.93034362792969|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:47,757 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:395.81965255737305|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:47,758 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:84.4|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:47,758 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:6.04248046875|#Level:Host,DeviceId:0|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:47,759 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:495.0|#Level:Host,DeviceId:0|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:47,760 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:47,760 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:9921.55078125|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:47,761 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:6301.578125|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:47,761 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:38.8|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620387
2024-10-11T09:49:48,530 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Listening on addr:port: 127.0.0.1:9000
2024-10-11T09:49:48,536 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Successfully loaded D:\Text-to-Image\torchserve\ts_env\Lib\site-packages/ts/configs/metrics.yaml.
2024-10-11T09:49:48,536 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - [PID]15736
2024-10-11T09:49:48,536 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Torch worker started.
2024-10-11T09:49:48,537 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Python runtime: 3.10.6
2024-10-11T09:49:48,537 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-alexnet_1.0 State change null -> WORKER_STARTED
2024-10-11T09:49:48,539 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /127.0.0.1:9000
2024-10-11T09:49:48,545 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Connection accepted: ('127.0.0.1', 9000).
2024-10-11T09:49:48,547 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1728620388547
2024-10-11T09:49:48,548 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1728620388548
2024-10-11T09:49:48,568 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - model_name: alexnet, batchSize: 1
2024-10-11T09:49:49,473 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Enabled tensor cores
2024-10-11T09:49:49,474 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - OpenVINO is not enabled
2024-10-11T09:49:49,474 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - proceeding without onnxruntime
2024-10-11T09:49:49,475 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-10-11T09:49:49,683 [WARN ] W-9000-alexnet_1.0-stderr MODEL_LOG - D:\Text-to-Image\torchserve\ts_env\lib\site-packages\ts\torch_handler\base_handler.py:355: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
2024-10-11T09:49:49,684 [WARN ] W-9000-alexnet_1.0-stderr MODEL_LOG -   state_dict = torch.load(model_pt_path, map_location=map_location)
2024-10-11T09:49:50,016 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1468
2024-10-11T09:49:50,017 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-alexnet_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-10-11T09:49:50,018 [INFO ] W-9000-alexnet_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:2961.0|#WorkerName:W-9000-alexnet_1.0,Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620390
2024-10-11T09:49:50,019 [INFO ] W-9000-alexnet_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:4.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620390
2024-10-11T09:50:04,545 [INFO ] nioEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728620404
2024-10-11T09:50:04,547 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1728620404547
2024-10-11T09:50:04,547 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1728620404547
2024-10-11T09:50:04,548 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Backend received inference at: 1728620404
2024-10-11T09:50:07,895 [INFO ] W-9000-alexnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:3346.83|#ModelName:alexnet,Level:Model|#type:GAUGE|#hostname:DESKTOP-7K8171O,1728620407,94ba59d7-cf76-4cf6-9f5b-6588170230d2, pattern=[METRICS]
2024-10-11T09:50:07,896 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.BatchAggregator - Sending response for jobId 94ba59d7-cf76-4cf6-9f5b-6588170230d2
2024-10-11T09:50:07,896 [INFO ] W-9000-alexnet_1.0-stdout MODEL_METRICS - HandlerTime.ms:3346.83|#ModelName:alexnet,Level:Model|#hostname:DESKTOP-7K8171O,requestID:94ba59d7-cf76-4cf6-9f5b-6588170230d2,timestamp:1728620407
2024-10-11T09:50:07,897 [INFO ] W-9000-alexnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:3346.83|#ModelName:alexnet,Level:Model|#type:GAUGE|#hostname:DESKTOP-7K8171O,1728620407,94ba59d7-cf76-4cf6-9f5b-6588170230d2, pattern=[METRICS]
2024-10-11T09:50:07,897 [INFO ] W-9000-alexnet_1.0-stdout MODEL_METRICS - PredictionTime.ms:3346.83|#ModelName:alexnet,Level:Model|#hostname:DESKTOP-7K8171O,requestID:94ba59d7-cf76-4cf6-9f5b-6588170230d2,timestamp:1728620407
2024-10-11T09:50:07,897 [INFO ] W-9000-alexnet_1.0 ACCESS_LOG - /127.0.0.1:59859 "PUT /predictions/alexnet HTTP/1.1" 200 3353
2024-10-11T09:50:07,898 [INFO ] W-9000-alexnet_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620407
2024-10-11T09:50:07,898 [INFO ] W-9000-alexnet_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:3349614.8|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728620407
2024-10-11T09:50:07,899 [INFO ] W-9000-alexnet_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:107.0|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728620407
2024-10-11T09:50:07,899 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 107000, Backend time ns: 3351905900
2024-10-11T09:50:07,899 [INFO ] W-9000-alexnet_1.0 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620407
2024-10-11T09:50:07,900 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 3348
2024-10-11T09:50:07,900 [INFO ] W-9000-alexnet_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:5.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620407
2024-10-11T09:50:13,893 [INFO ] nioEventLoopGroup-3-2 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728620413
2024-10-11T09:50:13,894 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1728620413894
2024-10-11T09:50:13,894 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1728620413894
2024-10-11T09:50:13,895 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Backend received inference at: 1728620413
2024-10-11T09:50:13,964 [INFO ] W-9000-alexnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:69.51|#ModelName:alexnet,Level:Model|#type:GAUGE|#hostname:DESKTOP-7K8171O,1728620413,7fb7e588-a7eb-409e-9dc8-3b9fd20abde0, pattern=[METRICS]
2024-10-11T09:50:13,964 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.BatchAggregator - Sending response for jobId 7fb7e588-a7eb-409e-9dc8-3b9fd20abde0
2024-10-11T09:50:13,964 [INFO ] W-9000-alexnet_1.0-stdout MODEL_METRICS - HandlerTime.ms:69.51|#ModelName:alexnet,Level:Model|#hostname:DESKTOP-7K8171O,requestID:7fb7e588-a7eb-409e-9dc8-3b9fd20abde0,timestamp:1728620413
2024-10-11T09:50:13,965 [INFO ] W-9000-alexnet_1.0 ACCESS_LOG - /127.0.0.1:59860 "PUT /predictions/alexnet HTTP/1.1" 200 72
2024-10-11T09:50:13,965 [INFO ] W-9000-alexnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:70.51|#ModelName:alexnet,Level:Model|#type:GAUGE|#hostname:DESKTOP-7K8171O,1728620413,7fb7e588-a7eb-409e-9dc8-3b9fd20abde0, pattern=[METRICS]
2024-10-11T09:50:13,965 [INFO ] W-9000-alexnet_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620413
2024-10-11T09:50:13,966 [INFO ] W-9000-alexnet_1.0-stdout MODEL_METRICS - PredictionTime.ms:70.51|#ModelName:alexnet,Level:Model|#hostname:DESKTOP-7K8171O,requestID:7fb7e588-a7eb-409e-9dc8-3b9fd20abde0,timestamp:1728620413
2024-10-11T09:50:13,966 [INFO ] W-9000-alexnet_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:71523.0|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728620413
2024-10-11T09:50:13,967 [INFO ] W-9000-alexnet_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:62.0|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728620413
2024-10-11T09:50:13,967 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 62000, Backend time ns: 73738100
2024-10-11T09:50:13,967 [INFO ] W-9000-alexnet_1.0 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620413
2024-10-11T09:50:13,967 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 70
2024-10-11T09:50:13,968 [INFO ] W-9000-alexnet_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:4.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728620413

And this are from inference side:

(ts_env) D:\Text-to-Image\torchserve>curl http://127.0.0.1:8080/predictions/alexnet -T kitten.jpg
{
  "tabby": 0.3188355267047882,
  "tiger_cat": 0.2579926550388336,
  "Egyptian_cat": 0.24233946204185486,
  "lynx": 0.1685769110918045,
  "tiger": 0.00650126114487648
}
(ts_env) D:\Text-to-Image\torchserve>curl http://127.0.0.1:8080/predictions/alexnet -T huskey.jpg
{
  "Eskimo_dog": 0.6400406956672668,
  "Siberian_husky": 0.13422252237796783,
  "dogsled": 0.12762515246868134,
  "malamute": 0.06892743706703186,
  "Norwegian_elkhound": 0.018150705844163895
}
dummyuser-123 commented 6 days ago

Sorry by mistake I have pressed the close the issue button.

mreso commented 6 days ago

Seems like you're running the examples in two different environments with different python executables

D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe

vs

D:\Text-to-Image\torchserve\ts_env\Scripts\python.exe

Can you try to run the alexnet example in the same environment as the diffuser example?

dummyuser-123 commented 6 days ago

Okay, This is from the same environment:

(ts_diff_env) D:\Text-to-Image\torchserve_diffusers>torchserve --start --model-store model_store --models alexnet=alexnet.mar --disable-token-auth  --enable-model-api

(ts_diff_env) D:\Text-to-Image\torchserve_diffusers>WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-10-11T10:06:26,952 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-10-11T10:06:27,013 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": CreateProcess error=2, The system cannot find the file specified
2024-10-11T10:06:27,015 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-10-11T10:06:27,033 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-10-11T10:06:27,076 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml2024-10-11T10:06:27,178 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.12.0
TS Home: D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages
Current directory: D:\Text-to-Image\torchserve_diffusers
Temp directory: C:\Users\Win\AppData\Local\Temp
Metrics config path: D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 12
Max heap size: 4056 M
Python executable: D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: D:\Text-to-Image\torchserve_diffusers\model_store
Initial Models: alexnet=alexnet.mar
Log dir: D:\Text-to-Image\torchserve_diffusers\logs
Metrics dir: D:\Text-to-Image\torchserve_diffusers\logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 655350000
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: true
Workflow Store: D:\Text-to-Image\torchserve_diffusers\model_store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: true
2024-10-11T10:06:27,184 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: alexnet.mar
2024-10-11T10:06:29,872 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model alexnet
2024-10-11T10:06:29,872 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model alexnet
2024-10-11T10:06:29,872 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model alexnet loaded.
2024-10-11T10:06:29,873 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: alexnet, count: 1
2024-10-11T10:06:29,879 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: NioServerSocketChannel.
2024-10-11T10:06:29,880 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Scripts\python.exe, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages\ts\model_service_worker.py, --sock-type, tcp, --port, 9000, --metrics-config, D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml]
2024-10-11T10:06:29,925 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-10-11T10:06:29,925 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: NioServerSocketChannel.
2024-10-11T10:06:29,926 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-10-11T10:06:29,926 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: NioServerSocketChannel.
2024-10-11T10:06:29,927 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-10-11T10:06:31,294 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Listening on addr:port: 127.0.0.1:9000
2024-10-11T10:06:31,299 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Successfully loaded D:\Text-to-Image\torchserve_diffusers\ts_diff_env\Lib\site-packages/ts/configs/metrics.yaml.
2024-10-11T10:06:31,300 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - [PID]10636
2024-10-11T10:06:31,300 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Torch worker started.
2024-10-11T10:06:31,300 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Python runtime: 3.10.6
2024-10-11T10:06:31,301 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-alexnet_1.0 State change null -> WORKER_STARTED
2024-10-11T10:06:31,303 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /127.0.0.1:9000
2024-10-11T10:06:31,309 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Connection accepted: ('127.0.0.1', 9000).
2024-10-11T10:06:31,312 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1728621391312
2024-10-11T10:06:31,313 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1728621391313
2024-10-11T10:06:31,337 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - model_name: alexnet, batchSize: 1
2024-10-11T10:06:32,819 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Enabled tensor cores
2024-10-11T10:06:32,819 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - OpenVINO is not enabled
2024-10-11T10:06:32,820 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - proceeding without onnxruntime
2024-10-11T10:06:32,820 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-10-11T10:06:33,043 [WARN ] W-9000-alexnet_1.0-stderr MODEL_LOG - D:\Text-to-Image\torchserve_diffusers\ts_diff_env\lib\site-packages\ts\torch_handler\base_handler.py:355: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
2024-10-11T10:06:33,043 [WARN ] W-9000-alexnet_1.0-stderr MODEL_LOG -   state_dict = torch.load(model_pt_path, map_location=map_location)
2024-10-11T10:06:33,378 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 2063
2024-10-11T10:06:33,379 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-alexnet_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-10-11T10:06:33,379 [INFO ] W-9000-alexnet_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:3503.0|#WorkerName:W-9000-alexnet_1.0,Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728621393
2024-10-11T10:06:33,380 [INFO ] W-9000-alexnet_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:5.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728621393
2024-10-11T10:06:43,807 [INFO ] nioEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728621403
2024-10-11T10:06:43,809 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1728621403809
2024-10-11T10:06:43,809 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1728621403809
2024-10-11T10:06:43,810 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Backend received inference at: 1728621403
2024-10-11T10:06:50,072 [INFO ] W-9000-alexnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:6262.75|#ModelName:alexnet,Level:Model|#type:GAUGE|#hostname:DESKTOP-7K8171O,1728621410,3189e4ff-e587-4750-b26f-8fd6531911ac, pattern=[METRICS]
2024-10-11T10:06:50,073 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.BatchAggregator - Sending response for jobId 3189e4ff-e587-4750-b26f-8fd6531911ac
2024-10-11T10:06:50,074 [INFO ] W-9000-alexnet_1.0-stdout MODEL_METRICS - HandlerTime.ms:6262.75|#ModelName:alexnet,Level:Model|#hostname:DESKTOP-7K8171O,requestID:3189e4ff-e587-4750-b26f-8fd6531911ac,timestamp:1728621410
2024-10-11T10:06:50,074 [INFO ] W-9000-alexnet_1.0 ACCESS_LOG - /127.0.0.1:60071 "PUT /predictions/alexnet HTTP/1.1" 200 6268
2024-10-11T10:06:50,074 [INFO ] W-9000-alexnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:6262.75|#ModelName:alexnet,Level:Model|#type:GAUGE|#hostname:DESKTOP-7K8171O,1728621410,3189e4ff-e587-4750-b26f-8fd6531911ac, pattern=[METRICS]
2024-10-11T10:06:50,075 [INFO ] W-9000-alexnet_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728621410
2024-10-11T10:06:50,075 [INFO ] W-9000-alexnet_1.0-stdout MODEL_METRICS - PredictionTime.ms:6262.75|#ModelName:alexnet,Level:Model|#hostname:DESKTOP-7K8171O,requestID:3189e4ff-e587-4750-b26f-8fd6531911ac,timestamp:1728621410
2024-10-11T10:06:50,075 [INFO ] W-9000-alexnet_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:6265728.4|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728621410
2024-10-11T10:06:50,076 [INFO ] W-9000-alexnet_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:102.5|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728621410
2024-10-11T10:06:50,076 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 102500, Backend time ns: 6268402200
2024-10-11T10:06:50,077 [INFO ] W-9000-alexnet_1.0 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728621410
2024-10-11T10:06:50,077 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 6264
2024-10-11T10:06:50,077 [INFO ] W-9000-alexnet_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:4.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728621410
2024-10-11T10:06:53,259 [INFO ] nioEventLoopGroup-3-2 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728621413
2024-10-11T10:06:53,259 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1728621413259
2024-10-11T10:06:53,259 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1728621413259
2024-10-11T10:06:53,260 [INFO ] W-9000-alexnet_1.0-stdout MODEL_LOG - Backend received inference at: 1728621413
2024-10-11T10:06:53,327 [INFO ] W-9000-alexnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:66.51|#ModelName:alexnet,Level:Model|#type:GAUGE|#hostname:DESKTOP-7K8171O,1728621413,663d9476-ce81-42c2-b8e7-dc60b2d93624, pattern=[METRICS]
2024-10-11T10:06:53,328 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.BatchAggregator - Sending response for jobId 663d9476-ce81-42c2-b8e7-dc60b2d93624
2024-10-11T10:06:53,328 [INFO ] W-9000-alexnet_1.0-stdout MODEL_METRICS - HandlerTime.ms:66.51|#ModelName:alexnet,Level:Model|#hostname:DESKTOP-7K8171O,requestID:663d9476-ce81-42c2-b8e7-dc60b2d93624,timestamp:1728621413
2024-10-11T10:06:53,328 [INFO ] W-9000-alexnet_1.0 ACCESS_LOG - /127.0.0.1:60072 "PUT /predictions/alexnet HTTP/1.1" 200 70
2024-10-11T10:06:53,328 [INFO ] W-9000-alexnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:66.51|#ModelName:alexnet,Level:Model|#type:GAUGE|#hostname:DESKTOP-7K8171O,1728621413,663d9476-ce81-42c2-b8e7-dc60b2d93624, pattern=[METRICS]
2024-10-11T10:06:53,329 [INFO ] W-9000-alexnet_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728621413
2024-10-11T10:06:53,329 [INFO ] W-9000-alexnet_1.0-stdout MODEL_METRICS - PredictionTime.ms:66.51|#ModelName:alexnet,Level:Model|#hostname:DESKTOP-7K8171O,requestID:663d9476-ce81-42c2-b8e7-dc60b2d93624,timestamp:1728621413
2024-10-11T10:06:53,329 [INFO ] W-9000-alexnet_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:68171.0|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728621413
2024-10-11T10:06:53,330 [INFO ] W-9000-alexnet_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:56.6|#model_name:alexnet,model_version:default|#hostname:DESKTOP-7K8171O,timestamp:1728621413
2024-10-11T10:06:53,330 [DEBUG] W-9000-alexnet_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 56600, Backend time ns: 70408500
2024-10-11T10:06:53,331 [INFO ] W-9000-alexnet_1.0 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728621413
2024-10-11T10:06:53,331 [INFO ] W-9000-alexnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 67
2024-10-11T10:06:53,331 [INFO ] W-9000-alexnet_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:5.0|#Level:Host|#hostname:DESKTOP-7K8171O,timestamp:1728621413
(ts_env) D:\Text-to-Image\torchserve>curl http://127.0.0.1:8080/predictions/alexnet -T kitten.jpg
{
  "tabby": 0.3188355267047882,
  "tiger_cat": 0.2579926550388336,
  "Egyptian_cat": 0.24233946204185486,
  "lynx": 0.1685769110918045,
  "tiger": 0.00650126114487648
}
(ts_env) D:\Text-to-Image\torchserve>curl http://127.0.0.1:8080/predictions/alexnet -T huskey.jpg
{
  "Eskimo_dog": 0.6400406956672668,
  "Siberian_husky": 0.13422252237796783,
  "dogsled": 0.12762515246868134,
  "malamute": 0.06892743706703186,
  "Norwegian_elkhound": 0.018150705844163895
}
dummyuser-123 commented 6 days ago

Also by mistake I have made inference call from different env, so here it is the updated log of it.

(ts_diff_env) D:\Text-to-Image\torchserve_diffusers>curl http://127.0.0.1:8080/predictions/alexnet -T kitten.jpg
{
  "tabby": 0.3188355267047882,
  "tiger_cat": 0.2579926550388336,
  "Egyptian_cat": 0.24233946204185486,
  "lynx": 0.1685769110918045,
  "tiger": 0.00650126114487648
}
(ts_diff_env) D:\Text-to-Image\torchserve_diffusers>curl http://127.0.0.1:8080/predictions/alexnet -T huskey.jpg
{
  "Eskimo_dog": 0.6400406956672668,
  "Siberian_husky": 0.13422252237796783,
  "dogsled": 0.12762515246868134,
  "malamute": 0.06892743706703186,
  "Norwegian_elkhound": 0.018150705844163895
}
dummyuser-123 commented 6 days ago

Also, while debugging the issue yesterday I saw this difference with diffusers and alexnet part.

When starting the alexnet model, it create such files in temp folder

Screenshot 2024-10-10 154415

But when starting the diffusers model, it is not creating any such files in temp folder

Screenshot 2024-10-11 101336

In case, if this might help you to find the exact problem.

mreso commented 6 days ago

Can you check your MAR file and post its content? Its basically a zip file that you can just decompress.

dummyuser-123 commented 19 hours ago

Hey @mreso, sorry for the delay. I tried using a new environment, and it worked for me. I'm not sure exactly what made it work, but I made two changes in the new environment. First, I only installed torchserve, torch-model-archiver, and torch-workflow-archiver (the old environment had additional libraries). Second, I created a custom Stable Diffusion handler that uses .safetensors weights instead of the diffusers format weights. After making these changes, torchserve is working properly.

Thank you for your support and quick response!