Open gal-star opened 2 years ago
Hi,
Thanks for the proposal!
This is something we would like to get done in the future (probably using at::parallel_for
from PyTorch), but it might be slightly more difficult to get right.
Indeed, we've tried adding -fopenmp
in torchvision in the past, but we reverted it because linking against the same OpenMP that PyTorch was using required more work, see https://github.com/pytorch/vision/pull/3038 (and discussion in the linked issue).
If you have ideas on how this could be improved so that we don't get the same type of segfaults that torchaudio was facing, we would love to know! Maybe this would involve using CMake to compile the extensions, which would be a lot more work though.
Hi, I am working with Gal. Two questions:
Thanks, Ran.
Hi @RanACohen
1 - My current take is that we won't be using TBB except if through PyTorch abstractions. 2 - The issue was because of conflicting dependencies between what PyTorch was using for parallelization vs what torchaudio was picking.
so I take it that if we use at::parallel_for instead of OpenMP, we are good, right?
so I take it that if we use at::parallel_for instead of OpenMP, we are good, right?
Yes.
But the problem still remains that we need to enable somehow OpenMP (or alike) during compilation time, and the errors that torchaudio were facing will remain.
can you elaborate on how to reconstruct the torchaudio issue? we would like to solve this. We need to enable the parallelizaion in Torch, it is a waste of CPU resources to ignore it.
π The feature
Looking at the implementation of roi_align_kernel, it seems as if this can be further optimized using openmp parallelization
https://github.com/pytorch/vision/blob/840ad8abd60b76d340ae0bde33e2230fad38e95a/torchvision/csrc/ops/cpu/roi_align_kernel.cpp#L27
Here's what can be done to get performance boost:
#pragma omp parallel for
to the kernel (line 27)Motivation, pitch
I did some experimentation locally in which:
torchvision.ops.roi_align()
and measured time using current implementation vs. 18 threads on simple CLX machine.On my humble experiments it shows 10X performance boost!
Alternatives
There can be other libraries/tooling that can do optimization to this CPU kernel. One can think of oneTBB or something alike. Nevertheless, the current implementation is a really naive and can easily be much performant.
Additional context
No response