pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.04k stars 821 forks source link

Running segment_anything_fast example locally #3186

Open yousofaly opened 2 weeks ago

yousofaly commented 2 weeks ago

🐛 Describe the bug

I have followed the installation instruction in the main readme file, followed by the instructions to run the segment_anything_fast example. I am encountering an odd error.

Error logs

java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1765) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:435) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:229) ~[model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.lang.Thread.run(Thread.java:1570) [?:?] 2024-06-11T13:46:54,390 [WARN ] W-9006-sam-fast_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: sam-fast, error: Worker died. 2024-06-11T13:46:54,390 [DEBUG] W-9006-sam-fast_1.0 org.pytorch.serve.wlm.WorkerThread - W-9006-sam-fast_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2024-06-11T13:46:54,390 [WARN ] W-9006-sam-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again 2024-06-11T13:46:54,390 [INFO ] W-9006-sam-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9006 in 55 seconds. 2024-06-11T13:46:54,390 [INFO ] W-9006-sam-fast_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9006-sam-fast_1.0-stdout 2024-06-11T13:46:54,390 [INFO ] W-9006-sam-fast_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9006-sam-fast_1.0-stderr 2024-06-11T13:46:54,396 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - Listening on addr:port: 127.0.0.1:9009 2024-06-11T13:46:54,401 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - Successfully loaded /opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/ts/configs/metrics.yaml. 2024-06-11T13:46:54,401 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - [PID]86770 2024-06-11T13:46:54,401 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - Torch worker started. 2024-06-11T13:46:54,401 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - Python runtime: 3.10.14 2024-06-11T13:46:54,401 [DEBUG] W-9009-sam-fast_1.0 org.pytorch.serve.wlm.WorkerThread - W-9009-sam-fast_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2024-06-11T13:46:54,402 [INFO ] W-9009-sam-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /127.0.0.1:9009 2024-06-11T13:46:54,402 [DEBUG] W-9009-sam-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1718138814402 2024-06-11T13:46:54,402 [INFO ] W-9009-sam-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1718138814402 2024-06-11T13:46:54,402 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - Connection accepted: ('127.0.0.1', 9009). 2024-06-11T13:46:54,403 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - model_name: sam-fast, batchSize: 1 2024-06-11T13:46:54,463 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - Backend worker process died. 2024-06-11T13:46:54,463 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - Traceback (most recent call last): 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/ts/model_loader.py", line 108, in load 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - module, function_name = self._load_handler_file(handler) 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/ts/model_loader.py", line 153, in _load_handler_file 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - module = importlib.import_module(module_name) 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/importlib/init.py", line 126, in import_module 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - return _bootstrap._gcd_import(name[level:], package, level) 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 1050, in _gcd_import 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 1027, in _find_and_load 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 1006, in _find_and_load_unlocked 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 688, in _load_unlocked 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 883, in exec_module 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 241, in _call_with_frames_removed 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/Users/yousof/Desktop/serve/examples/large_models/segment_anything_fast/model_store/sam-fast/custom_handler.py", line 11, in 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - from segment_anything_fast import SamAutomaticMaskGenerator, sam_model_fast_registry 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/segment_anything_fast/init.py", line 7, in 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - from .build_sam import ( 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/segment_anything_fast/build_sam.py", line 11, in 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - from .modeling import ImageEncoderViT, MaskDecoder, PromptEncoder, Sam, TwoWayTransformer 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/segment_anything_fast/modeling/init.py", line 7, in 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - from .sam import Sam 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/segment_anything_fast/modeling/sam.py", line 13, in 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - from .image_encoder import ImageEncoderViT 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/segment_anything_fast/modeling/image_encoder.py", line 15, in 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - from segment_anything_fast.flash_4 import _attention_rel_h_rel_w 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/segment_anything_fast/flash_4.py", line 23, in 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - import triton 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - ModuleNotFoundError: No module named 'triton' 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - During handling of the above exception, another exception occurred: 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - Traceback (most recent call last): 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/ts/model_service_worker.py", line 263, in 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - worker.run_server() 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/ts/model_service_worker.py", line 231, in run_server 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket) 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/ts/model_service_worker.py", line 194, in handle_connection 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg) 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - service = model_loader.load( 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/ts/model_loader.py", line 110, in load 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - module = self._load_default_handler(handler) 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/site-packages/ts/model_loader.py", line 159, in _load_default_handler 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - module = importlib.import_module(module_name, "ts.torch_handler") 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "/opt/anaconda3/envs/trchsrv/lib/python3.10/importlib/init.py", line 126, in import_module 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - return _bootstrap._gcd_import(name[level:], package, level) 2024-06-11T13:46:54,464 [INFO ] nioEventLoopGroup-5-28 org.pytorch.serve.wlm.WorkerThread - 9009 Worker disconnected. WORKER_STARTED 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 1050, in _gcd_import 2024-06-11T13:46:54,464 [DEBUG] W-9009-sam-fast_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 1027, in _find_and_load 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 992, in _find_and_load_unlocked 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 241, in _call_with_frames_removed 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 1050, in _gcd_import 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 1027, in _find_and_load 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - File "", line 1004, in _find_and_load_unlocked 2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - ModuleNotFoundError: No module named 'ts.torch_handler.custom_handler' 2024-06-11T13:46:54,464 [DEBUG] W-9009-sam-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died., responseTimeout:300sec java.lang.InterruptedException: null

Installation instructions

Ihave also tried to follow the docker instructions and am encountering the same error. I am on a MacBook, so no access to a GPU, but as far as I can tell, that shouldn't be an issue as

./build_image.sh installs a CPU version by default.

Model Packaging

the model was packaged according to the instructions in there sam_fast README.MD file.

config.properties

config.properties is unchanged from the original cloned file.

Versions

I am running python 3.10 and the followng torch packages

pytorch-labs-segment-anything-fast @ git+https://github.com/pytorch-labs/segment-anything-fast.git@3e9c47d2ef18ddf4f179128e8c0f677dd5e989b8 torch==2.2.2 torch-model-archiver @ file:///usr/share/miniconda/envs/setup_conda/conda-bld/torch-model-archiver_1715885178714/work torch-workflow-archiver @ file:///usr/share/miniconda/envs/setup_conda/conda-bld/torch-workflow-archiver_1715885227278/work torchao==0.1 torchserve @ file:///usr/share/miniconda/envs/__setup_conda/conda-bld/torchserve_1715885095944/work torchvision==0.17.2

Repro instructions

git clone https://github.com/pytorch/serve.git`
cd serve

Install dependencies

cuda is optional

python ./ts_scripts/install_dependencies.py --cuda=cu121

Latest release

pip install torchserve torch-model-archiver torch-workflow-archiver

Nightly build

pip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archiver-nightly

cd to the example folder examples/large_models/segment_anything_fast

cd ../examples/large_models/segment_anything_fast

install segment_anything_fast

chmod +x install_segment_anything_fast.sh
source install_segment_anything_fast.sh

download weights

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

If you are not using A100 for inference, turn off the A100 specific optimization using

export SEGMENT_ANYTHING_FAST_USE_FLASH_4=0

generate model archive

mkdir model_store
torch-model-archiver --model-name sam-fast --version 1.0 --handler custom_handler.py --config-file model-config.yaml --archive-format no-archive  --export-path model_store -f
mv sam_vit_h_4b8939.pth model_store/sam-fast/

start and run

torchserve --start --ncs --model-store model_store --models sam-fast
python inference.py

Possible Solution

No response

mreso commented 2 weeks ago

Hi @yousofaly, sorry, but the example will not run on a MacBook as the segment_anything_fast fork is specifically optimized for running on GPU (specifically A100s).

In you case it fails because the triton package is missing which is used to create a custom CUDA kernel to replace an auto generated on in torch.compile:

2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - import triton
2024-06-11T13:46:54,464 [INFO ] W-9009-sam-fast_1.0-stdout MODEL_LOG - ModuleNotFoundError: No module named 'triton'

You can try to run the original segment_anything version in the handler but this might need some modifications to the original example.