pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

package up metric collector independently #1484

Closed msaroufim closed 2 years ago

msaroufim commented 2 years ago

The metric collector is an independently interesting project outside of just torchserve which anyone can use to get system metrics for a pytorch inference

We can consider packaging it up indepdently for people to run it as a utility profiler https://github.com/pytorch/serve/blob/master/ts/metrics/metric_collector.py

Perhaps we can roll this into our existing #1457 efforts and combine it with ideas from https://github.com/pytorch/benchmark including

msaroufim commented 2 years ago

For example if you just run

(ray) ubuntu@ip-172-31-63-237:~/serve/ts/metrics$ python3 metric_collector.py --gpu 0

You get the following logs, the last field here is unix time which a dashboard provider should be able to easily handle

CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-172-31-63-237,1648517292
DiskAvailable.Gigabytes:110.83368682861328|#Level:Host|#hostname:ip-172-31-63-237,1648517292
DiskUsage.Gigabytes:82.9699821472168|#Level:Host|#hostname:ip-172-31-63-237,1648517292
DiskUtilization.Percent:42.8|#Level:Host|#hostname:ip-172-31-63-237,1648517292
MemoryAvailable.Megabytes:38658.7265625|#Level:Host|#hostname:ip-172-31-63-237,1648517292
MemoryUsed.Megabytes:1951.671875|#Level:Host|#hostname:ip-172-31-63-237,1648517292
MemoryUtilization.Percent:6.0|#Level:Host|#hostname:ip-172-31-63-237,1648517292