Preconditioner overhead

hartwiganzt commented 5 years ago

All Krylov solvers are designed to take a preconditioner. If no preconditioner is present, an explicit copy of the Krylov vector is invoked. We see this can introduce significant overhead. E.g. for the thermal2 testcase, this is 33% of the BiCGSTAB runtime.

We should think about a better solution. Maybe point to the un-preconditioned vector?

hartwiganzt commented 5 years ago

Checking the GPU preconditioner application in more detail, it consists of an allocation, a copy, and a free. I can understand the first to parts, but where does the free come from? If it invokes the identity matrix copy, there should be no free...

[LOG] >>> allocation started on Executor[gko::CudaExecutor,0x757ba0] with Bytes[9824360]
[LOG] >>> allocation completed on Executor[gko::CudaExecutor,0x757ba0] at Location[0x7fffcf400000] with Bytes[9824360]
[LOG] >>> copy started from Executor[gko::CudaExecutor,0x757ba0] to Executor[gko::CudaExecutor,0x757ba0] from Location[0x7fffbce00000] to Location[0x7fffcf400000] with Bytes[9824360]
[LOG] >>> copy completed from Executor[gko::CudaExecutor,0x757ba0] to Executor[gko::CudaExecutor,0x757ba0] from Location[0x7fffbce00000] to Location[0x7fffcf400000] with Bytes[9824360]
[LOG] >>> free started on Executor[gko::CudaExecutor,0x757ba0] at Location[0x7fffa8c00000]
[LOG] >>> free completed on Executor[gko::CudaExecutor,0x757ba0] at Location[0x7fffa8c00000]

gflegar commented 5 years ago

There should be only a single copy - no allocation and no free, can't see where it's coming from right now.

I think I found the problem. Here is what is happening:

Identity matrix's apply calls copy_from: https://github.com/ginkgo-project/ginkgo/blob/develop/core/matrix/identity.cpp#L49
copy_from is generated through the EnableAbstractPolymorphicObject mixin (included through the EnableLinOp => EnablePolymorphicObject => EnableAbstractPolymorphicObject chain), and calls the virtual copy_from_impl method: https://github.com/ginkgo-project/ginkgo/blob/develop/core/base/polymorphic_object.hpp#L317
- NOTE: we have a bug with the logger events here - we do not log the polymorphic_object_copy_started and [...]_completed events in this case (we only log it in the base PolymorphicObject version, but not in the mixin ones). Though this is not the cause of the behavior we see here.
copy_from_impl is implemented by EnablePolymorphicObject, and since our result is a Dense vector, it will try to use the convert_to(Dense *) method from the ConvertibleTo<Dense> interface of the source operator: https://github.com/ginkgo-project/ginkgo/blob/develop/core/base/polymorphic_object.hpp#L461
Since the source is a Dense vector, its ConvertibleTo<Dense> interface is implemented by the EnablePolymorphicAssignment mixin (included through the EnableLinOp => EnablePolymorphicAssignment chain), which calls the Dense assignment operator *result = static_cast<Dense>(*this). And I think the static_cast here is the problem - it will first call the Dense copy constructor to perform the cast, creating a temporary object, and then it will move-assign it to the result. https://github.com/ginkgo-project/ginkgo/blob/develop/core/base/polymorphic_object.hpp#L502

I'll try to implement this a bit differently and see if we still encounter the same problem.

tcojean commented 5 years ago

So as a summary, what is left after the fix of #166 is that we still have a copy of the full residual vector even when the method has no preconditioner (due to using an "Identity" preconditioner). We want a way to not make any call to any preconditioner is such cases. But this implies a rather different algorithm from out current one for all solvers.

gflegar commented 5 years ago

I would first determine if that copy is even a bottleneck or not. What we were measuring before is allocation + copy + free. If you look at the timeline in #166 the copy itself looks negligible, but for some reason there is a gap between the call to the copy, and when it actually happens. We should try to figure out why this is, and then fix the root cause.

In general, I wouldn't just blindly go and complicate the implementation just because we think something is the bottleneck, but don't have the evidence for it. This also applies to #161.

That's why I would recommend using a profiling tool (either nvvp, or paraver if we need more flexibility), which can give us more details about what is really happening.

tcojean commented 5 years ago

I uploaded here a trace taken on a V100 on Juwels to figure out the copy time for BiCGSTAB on thermal2.

We extracted the timing by hand with @hartwiganzt for one iteration of BiCGSTAB (technically a "half loop iteration") and compared the preconditioner apply time to all the other kernel calls. Here is what we see:

    + residual check: >100 us (to fix)
    + step1: 47 us
    + preconditioner copy: 30 us
    + spmv : 200 us
    + dot: 25-30 us, two times
    + Step2 : 70 us

If we remove the residual check which is still "bugged" in this vesion, we see that in that case the preconditioner copy is roughly 7% of one half iteration. The other half iteration should be similar as for reference the step3 kernel takes 90 us, there are two dots again, one spmv, etc.

From this information, we can conclude two things:

This 7% is rather small and might be even lesser for other matrices (more dense). We could consider it to be acceptable.
If we find a better way of handling the preconditioner copy which does not imply code duplication, then we could nonetheless gain an extra <7%.

gflegar commented 5 years ago

step1:          3n reads           + 1n writes -> 47 us    (778 GB/s)
precond. copy:  1n reads           + 1n writes -> 30 us    (609 GB/s)
spmv (csr?):    3/2(nnz + n) reads + 1n writes -> 200 us   (593 GB/s)
dot:            2n reads           + some sync -> 25-30 us (609-731 GB/s)
step2:          2n reads           + 1n writes -> 70 us    (392 GB/s)
step3:          5n reads           + 2n writes -> 90 us    (711 GB/s)

Notes:

The theoretical peak BW of the V100 is 900 GB/s.
The bandwidthTest CUDA sample assumes that cudaMemcpy has a total of 2n memory transfers (1 read, 1 write), so 609 GB/s should be the correct number.
I modified the CUDA's 0_Simple/vectorAdd sample and got the counters from nvprof. A version that computes C = A + B on 128MB vectors reads 256MB from DRAM to L2 and writes 128MB from L2 to DRAM. A version that computes C = C + A + B reads 384MB from DRAM to L2 and writes 128MB from L2 to DRAM. Thus, it seems that GPUs (at least my GTX965M) use the no-write allocate policy, and output-only data does not have to be counted in the total reads (I think this is different than the default CPU behavior).

Questions remain:

Why is a simple copy, which just uses the CUDA runtime API as slow as (or even slower than) a dot product?
Why is step2 so slow?

2. If we find a better way of handling the preconditioner copy which does not imply code duplication, then we could nonetheless gain an extra <7%.

At least in the BiCGSTAB case, the preconditioner ties the quantities p and y; and s and z (note that this discussion would be a lot easier if we used meaningful names as per #29).

y is used in first spmv, finalize and step3, in between the preconditioner application that generates y and these two operations, p is not modified (it is only modified in step1 of the subsequent iteration). Thus, making y a reference to p if the preconditioner is identity would remove the need for this data copy.

z is used in the second spmv and step3. s is only modified in step2 of the subsequent iteration. Thus, z can reference the same memory as s if the preconditioner is an identity.

gflegar commented 5 years ago

Just ran 1_Utilities/p2pBandwidthLatencyTest on juwels. Intra-device copy should achieve around 750 GB/s. Note though, that we're using cudaMemcpyPeer and the CUDA sample uses cudaMemcpyPeerAsync to do the copy.

Detailed results

``` [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, Tesla V100-SXM2-16GB, pciBusID: 5d, pciDeviceID: 0, pciDomainID:0 Device: 1, Tesla V100-SXM2-16GB, pciBusID: 5e, pciDeviceID: 0, pciDomainID:0 Device: 2, Tesla V100-SXM2-16GB, pciBusID: 88, pciDeviceID: 0, pciDomainID:0 Device: 3, Tesla V100-SXM2-16GB, pciBusID: 89, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 2 3 0 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 741.22 9.78 11.01 11.00 1 9.86 749.76 11.02 10.99 2 11.03 10.97 751.20 9.74 3 11.04 10.98 9.78 749.76 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 0 741.22 48.60 48.57 48.58 1 48.58 749.76 48.58 48.58 2 48.60 48.60 751.20 48.60 3 48.57 48.58 48.57 752.65 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 757.76 10.39 17.80 18.95 1 10.37 755.56 17.24 17.89 2 17.82 17.33 759.23 10.40 3 18.88 17.73 10.42 759.97 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 753.38 97.01 96.94 96.99 1 97.01 757.03 97.03 97.03 2 97.04 97.06 756.29 97.04 3 97.06 97.04 97.04 754.83 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 0 2.11 17.87 18.66 18.62 1 17.52 2.16 18.74 18.66 2 18.53 18.51 2.15 17.57 3 18.05 17.97 17.52 2.18 CPU 0 1 2 3 0 3.73 9.35 9.44 9.59 1 8.98 3.52 9.25 9.46 2 9.32 9.18 3.75 9.75 3 9.38 9.20 9.71 3.81 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 0 2.14 2.09 2.09 2.10 1 2.17 2.17 2.17 2.17 2 2.11 2.10 2.15 2.12 3 2.11 2.10 2.11 2.20 CPU 0 1 2 3 0 4.00 2.85 2.67 2.67 1 2.65 3.61 2.74 2.64 2 2.82 2.86 3.72 2.81 3 2.86 2.87 2.83 3.87 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. ```

ginkgo-project / ginkgo

Preconditioner overhead #164