triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

thread control for pytorch backend to fix the issue of PyTorch very slow inference on multi-core CPUs #6896

Open yongbinfeng opened 8 months ago

yongbinfeng commented 8 months ago

Is your feature request related to a problem? Please describe.

For now the Tensorflow and ONNX backends in Triton support thread controls (here and here). Would like to have similar features for PyTorch as well.

This is useful because in several cases we have seen PyTorch inference runs (super) slow on multi-core CPU machines. In O(100) core machines we have even seen one inference takes several minutes, despite being a small model. This might be due to PyTorch internal problem, but a temporary solution seems to be to configure the number of intra-op parallism to 1.

See examples here, here, and also previously in Triton issues here. In our cases we have found out seting the number of instances is NOT enough to fix the problem, and we need to set both the number of model instances and the number of Intra-op parallelism to 1. Tested with some examples and have confirmed this can fix the slow inference problem for PyTorch on CPUs.

This can be done with at::set_num_threads(1) when loading the models https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html#runtime-api

We have one implemenation fixing this problem here. If this solution sounds good to you we can open one pull request on this.

oandreeva-nv commented 8 months ago

Thank you for the request! I've created a feature request ticket for our team: 6210

yongbinfeng commented 8 months ago

Thank you for the request! I've created a feature request ticket for our team: 6210

Thanks a lot! In our case we have found adding at::set_num_threads(1) really helps the pytorch inference on many-core CPUs. https://github.com/yongbinfeng/pytorch_backend/blob/withThreadControl/src/libtorch.cc#L489

yongbinfeng commented 8 months ago

We submitted the PR to the pytorch_backend repo: https://github.com/triton-inference-server/pytorch_backend/pull/125