torchvision.roi_align performance optimization with openMP

gal-star commented 2 years ago

🚀 The feature

Looking at the implementation of roi_align_kernel, it seems as if this can be further optimized using openmp parallelization

https://github.com/pytorch/vision/blob/840ad8abd60b76d340ae0bde33e2230fad38e95a/torchvision/csrc/ops/cpu/roi_align_kernel.cpp#L27

Here's what can be done to get performance boost:

Added #pragma omp parallel for to the kernel (line 27)
Added -fopenmp as CFLAG to the compilation
Set torch.set_num_threads() to desired num of OMP threads (on test/WL side).

Motivation, pitch

I did some experimentation locally in which:

I've added this optimization
Built a small test case that calls roi_align
Profiled torchvision.ops.roi_align() and measured time using current implementation vs. 18 threads on simple CLX machine.

On my humble experiments it shows 10X performance boost!

Alternatives

There can be other libraries/tooling that can do optimization to this CPU kernel. One can think of oneTBB or something alike. Nevertheless, the current implementation is a really naive and can easily be much performant.

Additional context

No response

fmassa commented 2 years ago

Hi,

Thanks for the proposal!

This is something we would like to get done in the future (probably using at::parallel_for from PyTorch), but it might be slightly more difficult to get right.

Indeed, we've tried adding -fopenmp in torchvision in the past, but we reverted it because linking against the same OpenMP that PyTorch was using required more work, see https://github.com/pytorch/vision/pull/3038 (and discussion in the linked issue).

If you have ideas on how this could be improved so that we don't get the same type of segfaults that torchaudio was facing, we would love to know! Maybe this would involve using CMake to compile the extensions, which would be a lot more work though.

RanACohen commented 2 years ago

Hi, I am working with Gal. Two questions:

Have you tried or willing to use TBB? This would require you to integrate the TBB lib as a 3rd part lib.
I did not follow in the issues you linked, what was the torchaudio seg-fault caused by OpenMP, can you elaborate?

Thanks, Ran.

fmassa commented 2 years ago

Hi @RanACohen

1 - My current take is that we won't be using TBB except if through PyTorch abstractions. 2 - The issue was because of conflicting dependencies between what PyTorch was using for parallelization vs what torchaudio was picking.

RanACohen commented 2 years ago

so I take it that if we use at::parallel_for instead of OpenMP, we are good, right?

fmassa commented 2 years ago

so I take it that if we use at::parallel_for instead of OpenMP, we are good, right?

Yes.

But the problem still remains that we need to enable somehow OpenMP (or alike) during compilation time, and the errors that torchaudio were facing will remain.

RanACohen commented 2 years ago

can you elaborate on how to reconstruct the torchaudio issue? we would like to solve this. We need to enable the parallelizaion in Torch, it is a waste of CPU resources to ignore it.

pytorch / vision