torch.eye(d) is slow and hogs cpu for d >= 182

akamaus commented 3 years ago

🐛 Bug

I stumbled upon excessive CPU usage for my training code running on GPU. After some investigations I found the culprit. It basically was

x = torch.eye(256).to('cuda')

To Reproduce

This is quick and loads single CPU core.

%%timeit
    torch.eye(181)
6.43 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This is 3 times slower and hogs 30 cores.

%%timeit
    torch.eye(182)
21.2 µs ± 84.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Environment

I tested on server with 72 Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz cores (hyper-threading is on)

% python3 collect_env.py

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.3 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 440.100
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.18.2
[pip3] pytorch-wavelets==1.2.2
[pip3] torch==1.6.0
[pip3] torchvision==0.4.2
[conda] Could not collect

cc @VitalyFedyunin @ngimel

ngimel commented 3 years ago

To workaround, can you try torch.set_num_threads(1) ? We should look at the thread settings in eye, something goes wrong as multi-threading should not lead to a slowdown.

akamaus commented 3 years ago

@ngimel As a workaround I'm perfectly happy with torch.eye(256, device='cuda'), and main problem is not a slowdown but heavy CPU load, so I can't run several experiments simultaneously. By the way. Even now I see some rare sparks of CPU load (%cpu jumps from 100 to 2000 in top). Is there a way to find the culprits?

imaginary-person commented 3 years ago

@ngimel As a workaround I'm perfectly happy with torch.eye(256, device='cuda'), and main problem is not a slowdown but heavy CPU load, so I can't run several experiments simultaneously. By the way. Even now I see some rare sparks of CPU load (%cpu jumps from 100 to 2000 in top). Is there a way to find the culprits?

The OpenMP implementation uses a threshold at::internal::GRAIN_SIZE to determine whether parallelization would be helpful for an array, or not, and it is deemed inefficient to parallelize over arrays shorter than GRAIN_SIZE. GRAIN_SIZE has heuristically been chosen as 32768in the current implementation.

I had initially assumed that d >= 182 triggers more threads because 182 * 182 is 33124, which being greater than 32768, triggers OpenMP parallelism, and d = 181 doesn't. However, my assumption was incorrect. Since an identity matrix can be created by simply filling d number of 1s in a zeroed-matrix, only d elements need to be assigned a value of 1. So, for this operation, work has to be done on only dnumber of elements, which is what at:parallel_for() does when called from eye_out_cpu().

But neither d = 181 nor d = 182 should result in parallelizing work across multiple threads as both fail OpenMP's parallelization criterion (#pragma omp parallel if), as they're less than GRAIN_SIZE.

So, I'm not sure why d = 182 results in more threads being used than d = 181. I don't know what the additional threads do, but I observed this behavior with strace as well. I'll try to debug with a debug build that I can use to set some breakpoints. It'd be great if someone could give me some advice on how I can debug this issue. Thanks!

imaginary-person commented 3 years ago

The OpenMP implementation uses a threshold at::internal::GRAIN_SIZE to determine whether parallelization would be helpful for an array, or not, and it is deemed inefficient to parallelize over arrays shorter than GRAIN_SIZE. GRAIN_SIZE has heuristically been chosen as 32768in the current implementation.

I had initially assumed that d >= 182 triggers more threads because 182 * 182 is 33124, which being greater than 32768, triggers OpenMP parallelism, and d = 181 doesn't. However, my assumption was incorrect.

I'll try to debug with a debug build that I can use to set some breakpoints. It'd be great if someone could give me some advice on how I can debug this issue. Thanks!

I checked with a debug build.

OBSERVATIONS

d >= 182 results in OpenMP creating additional threads for zeroing 182 182 elements of the matrix using theat::native::zero_() method, as 33124 exceeds GRAIN_SIZE, but 181 181 does not. The additional threads correspond to OpenMP threads for zeroing the matrix. For d less than 182, elements are zeroed via std::memset().
Although the number of threads created by OpenMP are equal to the number of physical cores, only two of them end up getting any work assigned in the parallel for loop if d = 182.
As a sidenote, the at::parallel_for() in the eye operation itself doesn't utilize OpenMP parallelism, as it has to write to just 182 memory locations, which is less than GRAIN_SIZE.

POTENTIAL SOLUTION: A higher threshold than at::internal::GRAIN_SIZE can probably be used for thezero_() method, if overhead due to parallelism actually hurts performance in this case, which would be more pronounced on machines with a large number of cores, and/or multiple NUMA nodes, as OpenMP creates threads equal to the number of the cores by default:

The following threshold can be adjusted: https://github.com/pytorch/pytorch/blob/0ea1abe07bfa4377fa92d2f9222a0a0eed095b68/aten/src/ATen/native/Fill.cpp#L110.

However, choosing a threshold would again be based more on heuristics yet again, as no one value would suit all machines. Perhaps the decision to set a higher threshold can take into account:

number of elements - already being used.
dtype().itemsize()- the size of the types of the elements 1 and 2 should be multiplied, as they're a proxy for the number of cache lines that would be loaded into the processor cache(s), in order to be zeroed.

@VitalyFedyunin : Please advise if you already have a fix in mind for this issue, and if I'm missing something. Thanks!

sparkingdark commented 3 years ago

is this resolved?

Bhavay-2001 commented 2 years ago

@janeyx99 , is this issue resolved?? Can I work on this issue. Please suggest some resoruces to resolve for the same. Thanks

janeyx99 commented 2 years ago

@janeyx99 , is this issue resolved?? Can I work on this issue. Please suggest some resoruces to resolve for the same. Thanks

Hey @Bhavay192! I don't think this issue has been resolved as @imaginary-person had closed the linked attempt (not merged). However, I do not have the expertise to propose resources to solve this particular problem. Would @ngimel @VitalyFedyunin @imaginary-person and maybe @peterbell10 make better suggestions?

aorenste commented 8 months ago

Looked into this and adding some notes: I think there's a secondary problem which is that the iterator interface is very slow. Forcing a single thread (so no thread startup/queuing overhead) is 60x slower than the direct call to at::native::zero_().

imaginary-person commented 8 months ago

Looked into this and adding some notes: I think there's a secondary problem which is that the iterator interface is very slow. Forcing a single thread (so no thread startup/queuing overhead) is 60x slower than the direct call to at::native::zero_().

eye_out_cpu calls at::TensorBase::zero_() first, and after that, it relies on pointer arithmetic for setting some elements to 1. at::TensorBase::zero_() is dispatched to at::native::zero_, which, in turn, calls at::TensorBase::fill_(0) if the number of elements of a tensor exceed 32768.

@aorenste, could you please elaborate on what's 60x slower than the direct call to at::native::zero_()? Did you mean at::TensorBase::zero_() with a single thread in the OpenMP thread-pool is slower than the direct call to at::native::zero_()? Thanks!

I'm confused because the additional overhead in calling at::TensorBase::zero_ instead of at::native::zero_ seems to be related to the Dispatcher, and not the TensorIterator implementation -

aorenste commented 8 months ago

It has nothing to do with thread-pools.

I believe the slowdown is due to the iterator's use of generic std::function<> which prevents the compiler from inlining and optimizing the calls. For example - in this case there's an identity function to "convert" the value (from zero to zero). If it were inlined then that function call disappears - but if you call it through a std::function<> then it can't be inlined and has to make an unnecessary call.

I wrote a prototype today that replaces the function pointers with template args and the 60x slowdown goes away - so I'm hoping I can make that generic and commit it.

imaginary-person commented 8 months ago

Thanks for clarifying, @aorenste! I'm still confused about what's the baseline for 60x slowdown, and what's it being compared against, because it looks like even at::TensorBase::zero_ eventually calls at::native::zero_, but I hadn't delved deep into TensorIterator, so I'll await your PR.

BTW, would that slowdown still be closer to 60x, if you'd preload tcmalloc? Thanks!

aorenste commented 8 months ago

I haven't tried preloading tcmalloc - but I can't imagine it would make a difference. This isn't an allocation issue - it's a raw number of CPU cycles issue.

Warning: Lots of conjecture and hand-waving below - but I think it's basically close to correct.

As you point out - this is rooted with at::native::zero_(). That's where the fun begins. If the size of the tensor is less than internal::GRAIN_SIZE then zero_() has a fast-case where it calls zero_cpu_ which calls std::memset() which is basically a simple loop - effectively:

while (count-- > 0) *ptr++ = value;

However, if the size of the tensor exceeds internal::GRAIN_SIZE then we call self.fill_() which does a bunch of indirection but then eventually turns into a call into at::native::fill_out().

at::native::fill_out() builds an iterator (TensorIteratorConfig) and then calls fill_stub() which (through some more indirection) turns into a call to at::native::fill_kernel().

at::native::fill_kernel() has some special cases for ScalarTypes but the one we're interested in is the generic case which does:

        cpu_kernel_vec(
            iter,
            [=]() -> scalar_t { return value; },
            [=]() { return Vectorized<scalar_t>(value); });
      }),

This is where the problem occurs - cpu_kernel_vec is a generic kernel - you pass it two lambdas to compute the value at each spot in the tensor and it loops over the iterator for you.

Now - if everything works perfectly and the compiler optimizes everything then it all gets inlined and turns into the same fast for loop as above. But we need to be careful - it's a well-known problem that if you put a lambda into a std::function<> then the compiler loses a lot of its ability to optimize it. And when the loop is as small and tight as the one above it doesn't take a lot to make it a LOT slower.

cpu_kernel_vec is defined as:

template <bool check_dynamic_cast=true, typename func_t, typename vec_func_t>
void cpu_kernel_vec(TensorIteratorBase& iter, func_t&& op, vec_func_t&& vop, int64_t grain_size = at::internal::GRAIN_SIZE) {

That's perfect - our lambdas are typed as func_t and vec_func_t so they stay as their original definition.

cpu_kernel_vec calls iter.for_each - and here's where the problem happens. iter.for_each() is defined as:

void TensorIteratorBase::for_each(loop2d_t loop, int64_t grain_size) {

Note: no template arguments.

(Buried under for_each is the parallelization code - but to make sure it wasn't a problem I commented it out so we ONLY did the direct-call non-parallel path and got the same results)

loop2d_t is an alias:

using loop2d_t = c10::function_ref<void(char** data, const int64_t* strides, int64_t size0, int64_t size1)>;

where function_ref is a custom class that basically acts similar to std::function<> - it's a little lighter weight but still blocks the tightest compiler optimizations.

As a result our loop looks more like this:

while (count--) {
  auto value = call_lambda_to_get_value();
  *ptr++ = value;
}

and that's enough to have a huge slowdown (relative to the tiny loop above).

My prototype changes for_each (and all the calls buried under it) to be templatized and the slowdown goes away.

peterbell10 commented 8 months ago

As a result our loop looks more like this:

No, it's the function wrapped inside the function_ref that loops over the tensor's data pointer, and for simple cases like this it is only called once. This is why the arguments to the function include the sizes and strides of the memory region being looped over. There is a small overhead here, but it pales in comparison to the actual operation of zeroing out the memory.

Just to double check, are you sure you built with compiler optimizations enabled? That would explain a massive speedup from using memset instead of a custom vectorized loop.

aorenste commented 8 months ago

You're right - I missed that the loop is buried under the loop call. In that case I'm not sure why changing to template arguments is such a dramatic speedup. I'll do more digging.

I'm fairly certain I'm building with compiler optimizations enabled - in the setup.py summary it says that my CXX flags includes "-O2".

aorenste commented 8 months ago

I think I found it. It seems to be the switch to parallel_for() that's the slowdown. I suspect that when I was testing turning parallel on and off I wasn't in a release build and thus it didn't show as much of a difference.

Ok - forget everything I said above.

@peterbell10 is right that the inner loop is buried in a templated class so the compiler is optimizing away the function calls. The entire issue seems to be spinning up the parallel threading.

You can see this by setting OMP_NUM_THREADS=1 and the cliff goes away.

pytorch / pytorch