ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
399 stars 86 forks source link

Preconditioner overhead #164

Open hartwiganzt opened 5 years ago

hartwiganzt commented 5 years ago

All Krylov solvers are designed to take a preconditioner. If no preconditioner is present, an explicit copy of the Krylov vector is invoked. We see this can introduce significant overhead. E.g. for the thermal2 testcase, this is 33% of the BiCGSTAB runtime.

We should think about a better solution. Maybe point to the un-preconditioned vector?

hartwiganzt commented 5 years ago

Checking the GPU preconditioner application in more detail, it consists of an allocation, a copy, and a free. I can understand the first to parts, but where does the free come from? If it invokes the identity matrix copy, there should be no free...

[LOG] >>> allocation started on Executor[gko::CudaExecutor,0x757ba0] with Bytes[9824360]
[LOG] >>> allocation completed on Executor[gko::CudaExecutor,0x757ba0] at Location[0x7fffcf400000] with Bytes[9824360]
[LOG] >>> copy started from Executor[gko::CudaExecutor,0x757ba0] to Executor[gko::CudaExecutor,0x757ba0] from Location[0x7fffbce00000] to Location[0x7fffcf400000] with Bytes[9824360]
[LOG] >>> copy completed from Executor[gko::CudaExecutor,0x757ba0] to Executor[gko::CudaExecutor,0x757ba0] from Location[0x7fffbce00000] to Location[0x7fffcf400000] with Bytes[9824360]
[LOG] >>> free started on Executor[gko::CudaExecutor,0x757ba0] at Location[0x7fffa8c00000]
[LOG] >>> free completed on Executor[gko::CudaExecutor,0x757ba0] at Location[0x7fffa8c00000]
gflegar commented 5 years ago

There should be only a single copy - no allocation and no free, can't see where it's coming from right now.

I think I found the problem. Here is what is happening:

  1. Identity matrix's apply calls copy_from: https://github.com/ginkgo-project/ginkgo/blob/develop/core/matrix/identity.cpp#L49
  2. copy_from is generated through the EnableAbstractPolymorphicObject mixin (included through the EnableLinOp => EnablePolymorphicObject => EnableAbstractPolymorphicObject chain), and calls the virtual copy_from_impl method: https://github.com/ginkgo-project/ginkgo/blob/develop/core/base/polymorphic_object.hpp#L317
    • NOTE: we have a bug with the logger events here - we do not log the polymorphic_object_copy_started and [...]_completed events in this case (we only log it in the base PolymorphicObject version, but not in the mixin ones). Though this is not the cause of the behavior we see here.
  3. copy_from_impl is implemented by EnablePolymorphicObject, and since our result is a Dense vector, it will try to use the convert_to(Dense *) method from the ConvertibleTo<Dense> interface of the source operator: https://github.com/ginkgo-project/ginkgo/blob/develop/core/base/polymorphic_object.hpp#L461
  4. Since the source is a Dense vector, its ConvertibleTo<Dense> interface is implemented by the EnablePolymorphicAssignment mixin (included through the EnableLinOp => EnablePolymorphicAssignment chain), which calls the Dense assignment operator *result = static_cast<Dense>(*this). And I think the static_cast here is the problem - it will first call the Dense copy constructor to perform the cast, creating a temporary object, and then it will move-assign it to the result. https://github.com/ginkgo-project/ginkgo/blob/develop/core/base/polymorphic_object.hpp#L502

I'll try to implement this a bit differently and see if we still encounter the same problem.

tcojean commented 5 years ago

So as a summary, what is left after the fix of #166 is that we still have a copy of the full residual vector even when the method has no preconditioner (due to using an "Identity" preconditioner). We want a way to not make any call to any preconditioner is such cases. But this implies a rather different algorithm from out current one for all solvers.

gflegar commented 5 years ago

I would first determine if that copy is even a bottleneck or not. What we were measuring before is allocation + copy + free. If you look at the timeline in #166 the copy itself looks negligible, but for some reason there is a gap between the call to the copy, and when it actually happens. We should try to figure out why this is, and then fix the root cause.

In general, I wouldn't just blindly go and complicate the implementation just because we think something is the bottleneck, but don't have the evidence for it. This also applies to #161.

That's why I would recommend using a profiling tool (either nvvp, or paraver if we need more flexibility), which can give us more details about what is really happening.

tcojean commented 5 years ago

I uploaded here a trace taken on a V100 on Juwels to figure out the copy time for BiCGSTAB on thermal2.

We extracted the timing by hand with @hartwiganzt for one iteration of BiCGSTAB (technically a "half loop iteration") and compared the preconditioner apply time to all the other kernel calls. Here is what we see:

    + residual check: >100 us (to fix)
    + step1: 47 us
    + preconditioner copy: 30 us
    + spmv : 200 us
    + dot: 25-30 us, two times
    + Step2 : 70 us

If we remove the residual check which is still "bugged" in this vesion, we see that in that case the preconditioner copy is roughly 7% of one half iteration. The other half iteration should be similar as for reference the step3 kernel takes 90 us, there are two dots again, one spmv, etc.

From this information, we can conclude two things:

  1. This 7% is rather small and might be even lesser for other matrices (more dense). We could consider it to be acceptable.
  2. If we find a better way of handling the preconditioner copy which does not imply code duplication, then we could nonetheless gain an extra <7%.
gflegar commented 5 years ago
step1:          3n reads           + 1n writes -> 47 us    (778 GB/s)
precond. copy:  1n reads           + 1n writes -> 30 us    (609 GB/s)
spmv (csr?):    3/2(nnz + n) reads + 1n writes -> 200 us   (593 GB/s)
dot:            2n reads           + some sync -> 25-30 us (609-731 GB/s)
step2:          2n reads           + 1n writes -> 70 us    (392 GB/s)
step3:          5n reads           + 2n writes -> 90 us    (711 GB/s)

Notes:

Questions remain:

2. If we find a better way of handling the preconditioner copy which does not imply code duplication, then we could nonetheless gain an extra <7%.

At least in the BiCGSTAB case, the preconditioner ties the quantities p and y; and s and z (note that this discussion would be a lot easier if we used meaningful names as per #29).

y is used in first spmv, finalize and step3, in between the preconditioner application that generates y and these two operations, p is not modified (it is only modified in step1 of the subsequent iteration). Thus, making y a reference to p if the preconditioner is identity would remove the need for this data copy.

z is used in the second spmv and step3. s is only modified in step2 of the subsequent iteration. Thus, z can reference the same memory as s if the preconditioner is an identity.

gflegar commented 5 years ago

Just ran 1_Utilities/p2pBandwidthLatencyTest on juwels. Intra-device copy should achieve around 750 GB/s. Note though, that we're using cudaMemcpyPeer and the CUDA sample uses cudaMemcpyPeerAsync to do the copy.

Detailed results ``` [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, Tesla V100-SXM2-16GB, pciBusID: 5d, pciDeviceID: 0, pciDomainID:0 Device: 1, Tesla V100-SXM2-16GB, pciBusID: 5e, pciDeviceID: 0, pciDomainID:0 Device: 2, Tesla V100-SXM2-16GB, pciBusID: 88, pciDeviceID: 0, pciDomainID:0 Device: 3, Tesla V100-SXM2-16GB, pciBusID: 89, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 2 3 0 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 741.22 9.78 11.01 11.00 1 9.86 749.76 11.02 10.99 2 11.03 10.97 751.20 9.74 3 11.04 10.98 9.78 749.76 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 0 741.22 48.60 48.57 48.58 1 48.58 749.76 48.58 48.58 2 48.60 48.60 751.20 48.60 3 48.57 48.58 48.57 752.65 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 757.76 10.39 17.80 18.95 1 10.37 755.56 17.24 17.89 2 17.82 17.33 759.23 10.40 3 18.88 17.73 10.42 759.97 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 753.38 97.01 96.94 96.99 1 97.01 757.03 97.03 97.03 2 97.04 97.06 756.29 97.04 3 97.06 97.04 97.04 754.83 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 0 2.11 17.87 18.66 18.62 1 17.52 2.16 18.74 18.66 2 18.53 18.51 2.15 17.57 3 18.05 17.97 17.52 2.18 CPU 0 1 2 3 0 3.73 9.35 9.44 9.59 1 8.98 3.52 9.25 9.46 2 9.32 9.18 3.75 9.75 3 9.38 9.20 9.71 3.81 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 0 2.14 2.09 2.09 2.10 1 2.17 2.17 2.17 2.17 2 2.11 2.10 2.15 2.12 3 2.11 2.10 2.11 2.20 CPU 0 1 2 3 0 4.00 2.85 2.67 2.67 1 2.65 3.61 2.74 2.64 2 2.82 2.86 3.72 2.81 3 2.86 2.87 2.83 3.87 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. ```