Open yongbinfeng opened 8 months ago
Thank you for the request! I've created a feature request ticket for our team: 6210
Thank you for the request! I've created a feature request ticket for our team: 6210
Thanks a lot! In our case we have found adding at::set_num_threads(1)
really helps the pytorch inference on many-core CPUs. https://github.com/yongbinfeng/pytorch_backend/blob/withThreadControl/src/libtorch.cc#L489
We submitted the PR to the pytorch_backend repo: https://github.com/triton-inference-server/pytorch_backend/pull/125
Is your feature request related to a problem? Please describe.
For now the Tensorflow and ONNX backends in Triton support thread controls (here and here). Would like to have similar features for PyTorch as well.
This is useful because in several cases we have seen PyTorch inference runs (super) slow on multi-core CPU machines. In O(100) core machines we have even seen one inference takes several minutes, despite being a small model. This might be due to PyTorch internal problem, but a temporary solution seems to be to configure the number of intra-op parallism to 1.
See examples here, here, and also previously in Triton issues here. In our cases we have found out seting the number of instances is NOT enough to fix the problem, and we need to set both the number of model instances and the number of Intra-op parallelism to 1. Tested with some examples and have confirmed this can fix the slow inference problem for PyTorch on CPUs.
This can be done with
at::set_num_threads(1)
when loading the models https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html#runtime-apiWe have one implemenation fixing this problem here. If this solution sounds good to you we can open one pull request on this.