Open yolk-pie-L opened 6 months ago
Hi @yolk-pie-L I was not able to reproduce this with the 0.9.0 docker and the Error log is inconclusive. We just released 0.10.0, could you retry with the new version?
@lxning do you have any idea what could cause the java.lang.InterruptedException: null?
@mreso
This is very strange because I can run it with .mar
files from other places, but not with the ones I pack myself. I tried using a public .mar file (which can be obtained via gsutil
at gs://kfserving-examples/models/torchserve/image_classifier/v1/model-store/mnist.mar), and it runs normally in the image. However, no matter if it's a diffuser or transformer model, the ones I pack myself cannot run. So, I suspect there is a problem with how the model is packaged.
For packing the model, I use another docker image huggingface/transformers-cpu:3.4.0
so as to use lower version of transformer. Because current version of transformer generate .safetensors
file and I don't know how to do with it. Therefore, I am using huggingface/transformers-cpu:3.4.0
for executing commands python Download_Transformer_models.py
and torch-model-archiver --model-name BERTSeqClassification --version 1.0 --serialized-file Transformer_model/pytorch_model.bin --handler ./Transformer_handler_generalized.py --extra-files "Transformer_model/config.json,./setup_config.json,./Seq_classification_artifacts/index_to_name.json"
, and using pytorch/torchserve:latest
to execute torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --ncs
I install additional python package in the docker image huggingface/transformers-cpu:3.4.0
to help packing the model.
# pip list
Package Version
---------------------- ----------
absl-py 0.10.0
argon2-cffi 20.1.0
asn1crypto 0.24.0
astunparse 1.6.3
async-generator 1.10
attrs 20.2.0
backcall 0.2.0
bleach 3.2.1
cachetools 4.1.1
certifi 2020.6.20
cffi 1.14.3
chardet 3.0.4
click 7.1.2
coloredlogs 15.0.1
cryptography 2.1.4
dataclasses 0.7
decorator 4.4.2
defusedxml 0.6.0
entrypoints 0.3
enum-compat 0.0.3
filelock 3.0.12
future 0.18.2
gast 0.3.3
google-auth 1.21.3
google-auth-oauthlib 0.4.1
google-pasta 0.2.0
grpcio 1.32.0
h5py 2.10.0
humanfriendly 10.0
idna 2.6
importlib-metadata 2.0.0
ipykernel 5.3.4
ipython 7.16.1
ipython-genutils 0.2.0
ipywidgets 7.5.1
jedi 0.17.2
Jinja2 2.11.2
joblib 0.17.0
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.7
jupyter-console 6.2.0
jupyter-core 4.6.3
jupyterlab-pygments 0.1.1
Keras-Preprocessing 1.1.2
keyring 10.6.0
keyrings.alt 3.0
Markdown 3.2.2
MarkupSafe 1.1.1
mistune 0.8.4
mpmath 1.3.0
nbclient 0.5.0
nbconvert 6.0.6
nbformat 5.0.7
nest-asyncio 1.4.1
notebook 6.1.4
numpy 1.18.5
oauthlib 3.1.0
opt-einsum 3.3.0
optimum 1.1.1
packaging 20.4
pandocfilters 1.4.2
parso 0.7.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 8.4.0
pip 20.2.3
prometheus-client 0.8.0
prompt-toolkit 3.0.7
protobuf 3.13.0
psutil 5.9.8
ptyprocess 0.6.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.20
pycrypto 2.6.1
Pygments 2.7.1
pygobject 3.26.1
pyparsing 2.4.7
pyrsistent 0.17.3
python-dateutil 2.8.1
pyxdg 0.25
pyzmq 19.0.2
qtconsole 4.7.7
QtPy 1.9.0
regex 2020.10.15
requests 2.24.0
requests-oauthlib 1.3.0
rsa 4.6
sacremoses 0.0.43
SecretStorage 2.3.1
Send2Trash 1.5.0
sentencepiece 0.1.91
setuptools 50.3.0
six 1.15.0
sympy 1.9
tensorboard 2.3.0
tensorboard-plugin-wit 1.7.0
tensorflow-cpu 2.3.1
tensorflow-estimator 2.3.0
termcolor 1.1.0
terminado 0.9.1
testpath 0.4.4
tokenizers 0.9.2
torch 1.10.2
torch-model-archiver 0.9.0
torchserve 0.9.0
tornado 6.0.4
tqdm 4.50.2
traitlets 4.3.3
transformers 3.4.0
typing-extensions 4.1.1
urllib3 1.25.10
wcwidth 0.2.5
webencodings 0.5.1
Werkzeug 1.0.1
wheel 0.30.0
widgetsnbextension 3.5.1
wrapt 1.12.1
zipp 3.2.0
My model is shared here https://github.com/yolk-pie-L/TorchServeModels. Could you help look at it?
Thank you so much!
@yolk-pie-L can you please try the following steps?
transformers==4.28.1
torch-model-archiver --model-name BERTSeqClassification --version 1.0 --serialized-file Transformer_model/pytorch_model.bin --handler ./Transformer_handler_generalized.py --extra-files "Transformer_model/config.json,./setup_config.json,./Seq_classification_artifacts/index_to_name.json" -r requirements.txt
🐛 Describe the bug
I followed the tutorial as https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers
First,
Then,
Finally,
The system cannot start as usualy, it gives out the error log, throwing an Exception
I tried curl to check the model
Error logs
2024-03-14T07:34:24,938 [INFO ] epollEventLoopGroup-5-17 org.pytorch.serve.wlm.WorkerThread - 9015 Worker disconnected. WORKER_STARTED 2024-03-14T07:34:24,938 [INFO ] W-9015-my_tc_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9015. 2024-03-14T07:34:24,938 [DEBUG] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2024-03-14T07:34:24,938 [DEBUG] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?] at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?] at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?] at org.pytorch.serve.wlm.Model.pollBatch(Model.java:367) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:36) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:194) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:833) [?:?] 2024-03-14T07:34:24,938 [DEBUG] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - W-9015-my_tc_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2024-03-14T07:34:24,938 [WARN ] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again 2024-03-14T07:34:24,939 [WARN ] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9015-my_tc_1.0-stderr 2024-03-14T07:34:24,939 [WARN ] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9015-my_tc_1.0-stdout 2024-03-14T07:34:24,939 [INFO ] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9015 in 3 seconds. 2024-03-14T07:34:24,946 [INFO ] W-9015-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9015-my_tc_1.0-stdout 2024-03-14T07:34:24,946 [INFO ] W-9015-my_tc_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9015-my_tc_1.0-stderr 2024-03-14T07:34:27,207 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9010, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,489 [DEBUG] W-9012-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9012, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,579 [DEBUG] W-9000-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,669 [DEBUG] W-9011-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9011, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,704 [DEBUG] W-9006-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9006, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,707 [DEBUG] W-9008-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9008, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,734 [DEBUG] W-9017-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9017, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,751 [DEBUG] W-9003-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9003, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,776 [DEBUG] W-9001-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9001, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,804 [DEBUG] W-9005-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9005, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,815 [DEBUG] W-9009-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9009, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,844 [DEBUG] W-9013-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9013, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,848 [DEBUG] W-9004-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9004, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,853 [DEBUG] W-9007-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9007, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,857 [DEBUG] W-9019-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9019, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,871 [DEBUG] W-9002-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9002, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,904 [DEBUG] W-9014-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9014, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,904 [DEBUG] W-9018-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9018, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,927 [DEBUG] W-9016-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9016, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,939 [DEBUG] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9015, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:28,642 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9010, pid=8906 2024-03-14T07:34:28,644 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9010 2024-03-14T07:34:28,657 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml. 2024-03-14T07:34:28,658 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - [PID]8906 2024-03-14T07:34:28,658 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - Torch worker started. 2024-03-14T07:34:28,659 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - W-9010-my_tc_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2024-03-14T07:34:28,659 [INFO ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9010 2024-03-14T07:34:28,660 [INFO ] epollEventLoopGroup-5-6 org.pytorch.serve.wlm.WorkerThread - 9010 Worker disconnected. WORKER_STARTED 2024-03-14T07:34:28,661 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2024-03-14T07:34:28,661 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?] at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:276) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:424) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:191) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:833) [?:?] 2024-03-14T07:34:28,661 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - W-9010-my_tc_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2024-03-14T07:34:28,662 [WARN ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again 2024-03-14T07:34:28,663 [WARN ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9010-my_tc_1.0-stderr 2024-03-14T07:34:28,663 [WARN ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9010-my_tc_1.0-stdout 2024-03-14T07:34:28,664 [INFO ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9010 in 5 seconds. 2024-03-14T07:34:28,692 [ERROR] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - Unknown exception io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2024-03-14T07:34:28,698 [INFO ] W-9010-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9010-my_tc_1.0-stdout 2024-03-14T07:34:28,698 [INFO ] W-9010-my_tc_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9010-my_tc_1.0-stderr
Installation instructions
pip install torchserve.
Yes, I am using docker image
pytorch/torchserve:latest
Model Packaing
I use transformers=3.4.0 to save the pretrained model into
config.properties
No response
Versions
torchserve==0.9.0 torch-model-archiver==0.9.0
Python version: 3.9 (64-bit runtime) Python executable: /home/venv/bin/python
Versions of relevant python libraries: captum==0.6.0 numpy==1.24.3 psutil==5.9.5 requests==2.31.0 torch==2.1.0+cpu torch-model-archiver==0.9.0 torch-workflow-archiver==0.2.11 torchaudio==2.1.0+cpu torchdata==0.7.0 torchserve==0.9.0 torchtext==0.16.0+cpu torchvision==0.16.0+cpu wheel==0.40.0 torch==2.1.0+cpu torchtext==0.16.0+cpu torchvision==0.16.0+cpu torchaudio==2.1.0+cpu
Java Version:
OS: Ubuntu 20.04.6 LTS GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: N/A CMake version: N/A
Environment: librarypath (LD/DYLD_):
Repro instructions
As described above
Possible Solution
No response