Open pankajvshrma opened 1 year ago
Do you see metrics in your torchserve logs file (ts_log.log
, ts_metrics.log
)?
@frankiedrake yes I see metrics getting logged in torchserve logs file (ts_log.log, ts_metrics.log). Sample log in ts_metrics.log:
2022-11-09T10:33:36,163 - Requests2XX.Count:1|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667982096 2022-11-09T10:33:51,163 - Requests2XX.Count:1|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667982096 2022-11-09T10:34:06,163 - Requests2XX.Count:1|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667982096 2022-11-09T10:34:21,163 - Requests2XX.Count:1|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667982096 2022-11-09T10:34:35,383 - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:35,384 - DiskAvailable.Gigabytes:247.98999786376953|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:35,384 - DiskUsage.Gigabytes:285.09499740600586|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:35,384 - DiskUtilization.Percent:53.5|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:35,384 - GPUMemoryUtilization.Percent:47.659371200277924|#Level:Host,device_id:0|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:35,384 - GPUMemoryUsed.Megabytes:10975|#Level:Host,device_id:0|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:35,384 - GPUUtilization.Percent:0|#Level:Host,device_id:0|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:35,384 - MemoryAvailable.Megabytes:111133.0078125|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:35,384 - MemoryUsed.Megabytes:15022.95703125|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:35,384 - MemoryUtilization.Percent:12.8|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667990075 2022-11-09T10:34:36,163 - Requests2XX.Count:1|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667982096 2022-11-09T10:34:51,163 - Requests2XX.Count:1|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667982096 2022-11-09T10:35:06,163 - Requests2XX.Count:1|#Level:Host|#hostname:ip-10-12-138-251,timestamp:1667982096
I was getting the empty response when I specified non-existing metric or when I was trying to get metrics when no inference API request was made. Maybe logs emit something weird when you're trying to get the metrics? Did you try to bind prometheus and see if metrics are availible there?
@frankiedrake yes I tried binding with Prometheus but still getting no output there. Also, I am not using any custom metrics. Even after successful API call I see empty response.
Is it a known problem with workflow based APIs?
Hey @pankajvshrma Right now we only support 3 inference metrics on the prometheus metrics endpoint. Unless you run an inference the metrics will be empty (with a 200OK response). If you have a different expectation, let us know
Hey @maaquib /metrics endpoint is empty even after inference. I was wondering if this is an issue with workflows.
@pankajvshrma Seems like a bug with workflow. Will look into this
@maaquib is there any update on this?
I'm encountering the same exact situation in the same environment (0.6.0-cpu torch-serve): even after the model inference /metrics endpoint returns nothing. I can also provide a docker / helm to reproduce
I'm encountering the same issue
I have the same issue using docker image 0.7.1-cpu. Any news?
I found a solution: metrics are available if you make an inference using models (at least one of the models), not a workflow.
E.g. curl http://127.0.0.1:8080/predictions/dog_breed_wf__dog_breed_classification -T path_to_image/img.jpg
and
curl http://127.0.0.1:8080/predictions/dog_breed_wf__cat_dog_classification -T path_to_image/img.jpg
Then curl http://127.0.0.1:8082/metrics
returns metrics.
pytorch/torchserve:0.6.1-gpu I also encountered this problem. Even after calling the model, it still returned empty results.
🐛 Describe the bug
I have written a custom handler. After running Torchserve and registering the workflow, when I try curl http://127.0.0.1:8082/metrics it returns nothing.
Error logs
http://127.0.0.1:8082/metrics it returns nothing
Installation instructions
Install Torchserve from source. Not using docker.
Model Packaing
I use parallel workflow framework and packaged 4 video models and 3 audio models. Two handler files, one for audio and one for video.
config.properties
enable_envvars_config=true max_request_size=65535000 number_of_netty_threads=8 netty_client_threads=8 job_queue_size=1000 default_response_timeout=300 unregister_model_timeout=300 install_py_dep_per_model=true
inference_address=http://0.0.0.0:8080 management_address=http://0.0.0.0:8081 metrics_address=http://0.0.0.0:8082 enable_metrics_api=true metrics_format=prometheus
Versions
Environment headers
Torchserve branch:
torchserve==0.6.0b20221029 torch-model-archiver==0.6.0b20221029
Python version: 3.8 (64-bit runtime) Python executable: /home/chingari/434/my_env/bin/python
Versions of relevant python libraries: captum==0.5.0 future==0.18.2 intel-extension-for-pytorch==1.12.300 numpy==1.23.4 nvgpu==0.9.0 psutil==5.9.3 pygit2==1.6.1 pylint==2.6.0 pytest==7.2.0 pytest-cov==4.0.0 pytest-mock==3.10.0 requests==2.28.1 requests-toolbelt==0.10.1 torch==1.12.0+cu113 torch-model-archiver==0.6.0b20221029 torch-workflow-archiver==0.2.4b20221029 torchaudio==0.12.0+cu113 torchserve==0.6.0b20221029 torchtext==0.13.0 torchvision==0.13.0+cu113 transformers==4.11.0 wheel==0.37.1 torch==1.12.0+cu113 torchtext==0.13.0 torchvision==0.13.0+cu113 torchaudio==0.12.0+cu113
Java Version:
OS: Ubuntu 20.04.4 LTS GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: N/A CMake version: N/A
Is CUDA available: Yes CUDA runtime version: 11.6.112 GPU models and configuration: GPU 0: NVIDIA A10G Nvidia driver version: 510.85.02 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
Repro instructions
torchserve --start \ --model-store tools/deployment/model_store \ --workflow-store tools/deployment/workflow_store \ --ncs \ --ts-config tools/deployment/config.properties \ --foreground
Logs
2022-11-09T07:57:21,740 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager... 2022-11-09T07:57:21,815 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.6.0 TS Home: /home/chingari/434/my_env/lib/python3.8/site-packages Current directory: /home/chingari/434/Video-Classification/mmaction2 Temp directory: /tmp Number of GPUs: 1 Number of CPUs: 32 Max heap size: 30688 M Python executable: /home/chingari/434/my_env/bin/python Config file: tools/deployment/config.properties Inference address: http://0.0.0.0:8080 Management address: http://0.0.0.0:8081 Metrics address: http://0.0.0.0:8082 Model Store: /home/chingari/434/Video-Classification/mmaction2/tools/deployment/model_store Initial Models: N/A Log dir: /home/chingari/434/Video-Classification/mmaction2/logs Metrics dir: /home/chingari/434/Video-Classification/mmaction2/logs Netty threads: 8 Netty client threads: 8 Default workers per model: 1 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 65535000 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.|http(s)?://.] Custom python dependency for model allowed: true Metrics report format: prometheus Enable metrics API: true Workflow Store: /home/chingari/434/Video-Classification/mmaction2/tools/deployment/workflow_store Model config: N/A 2022-11-09T07:57:21,815 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.6.0 TS Home: /home/chingari/434/my_env/lib/python3.8/site-packages Current directory: /home/chingari/434/Video-Classification/mmaction2 Temp directory: /tmp Number of GPUs: 1 Number of CPUs: 32 Max heap size: 30688 M Python executable: /home/chingari/434/my_env/bin/python Config file: tools/deployment/config.properties Inference address: http://0.0.0.0:8080 Management address: http://0.0.0.0:8081 Metrics address: http://0.0.0.0:8082 Model Store: /home/chingari/434/Video-Classification/mmaction2/tools/deployment/model_store Initial Models: N/A Log dir: /home/chingari/434/Video-Classification/mmaction2/logs Metrics dir: /home/chingari/434/Video-Classification/mmaction2/logs Netty threads: 8 Netty client threads: 8 Default workers per model: 1 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 65535000 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.|http(s)?://.] Custom python dependency for model allowed: true Metrics report format: prometheus Enable metrics API: true Workflow Store: /home/chingari/434/Video-Classification/mmaction2/tools/deployment/workflow_store Model config: N/A 2022-11-09T07:57:21,820 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin... 2022-11-09T07:57:21,820 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin... 2022-11-09T07:57:21,838 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2022-11-09T07:57:21,838 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2022-11-09T07:57:21,886 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080 2022-11-09T07:57:21,886 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080 2022-11-09T07:57:21,886 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2022-11-09T07:57:21,886 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2022-11-09T07:57:21,888 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081 2022-11-09T07:57:21,888 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081 2022-11-09T07:57:21,888 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2022-11-09T07:57:21,888 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2022-11-09T07:57:21,889 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082 2022-11-09T07:57:21,889 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Possible Solution
I tried to give API calls. Even after successful inference calls, still Metrics API response is empty, but http response is 200.