nhynes commented 5 years ago

Is there any plan to add support for running data-parallel operations in several CPU threads? Threading gives frameworks like TVM almost linear speedup per core, and it'd be nice to see in Glow!

As for implementation, I imagine something like OpenCL's enqueueKernel would be the right abstraction.

bertmaher commented 5 years ago

This would be cool, I agree. Can you elaborate on your enqueueKernel idea? I'm only somewhat familiar with OpenCL.

nhynes commented 5 years ago

re: enqueueKernel, I was going off of the code already used for SoftMax. For the CPU, I imagine it would look like splitting the outer loops of kernels and pushing each task into several worker threads' queues. In TVM, at least, this is done at runtime: the compiler generates closures and calls to the runtime which add those closures to a thread pool's threads.

If people think threading would be nice to have, I have good working knowledge of existing designs and could prepare a proposal for how this could be implemented in Glow.

nadavrot commented 5 years ago

@nhynes Do you have a specific use-case in mind? I agree with your analogy to OpenCL. OpenCL views the CPU, with all of it's cores as a single compute device and it splits the work across multiple cores. This comparison makes sense. I agree that having a multithreaded implementation of some operators is the best way to implement a low-latency optimized networks. However, I think that this design would clash with a design that maximizes the throughput. Most production systems care about batches of inputs. If you care about the total number of inputs on a system then a better design would be to split the graph across multiple cores (data-flow architecture) in a way that would minimize the transfer of memory in and out of the cache. @bertmaher Started workin on such partitioning.

This is something that needs to be designed. It's really easy to wrap the outermost loop in Convolution with OpenMP pragma, but I don't think that this is the right long term direction. Also, it would be really difficult to manage the complexity of the system as we implement production-level inference that takes into account multiple batches. Let's not do the easy thing and plan ahead and solve the big picture.

nhynes commented 5 years ago

Thanks for your reply @nadavrot. The specific use case I had in mind is actually very specific (so much so, actually, that not even OMP would suffice). Basically the idea is to link a training model into a trusted hardware enclave. Enclaves have a restricted programming model so porting PyTorch or even OpenBLAS isn't possible. Moreover enclaves can't spawn threads and need to request them from the untrusted OS. Also, context switching between enclaves is very high overhead, so splitting the model across cores might hurt performance. I haven't actually considered cache locality, though, so it might be worth testing out! Is there an easy way to do this in Glow?

I'm currently using TVM because it supports training and threads, but the architecture of Glow is (imho) much cleaner and has better integration with a more popular ML framework. Here's a tech report describing the approach, if you're interested.

nadavrot commented 5 years ago

@nhynes Ah, I understand. Thank you. That's an interesting use-case that I never would have guessed. I am not familiar with the secure enclave but it sounds interesting.

Using OpenMP (or something similar) to parallelize the outer most loop of convolution will ensure that when you bring in the weights of the convolution then all of the cores can share the memory, or at least not compete and evict each others cache lines. Of course, this is only relevant for the last-level cache, which is large enough to hold the convolution weights.

nhynes commented 5 years ago

And here's the poster for Devcon which might help describe and motivate the use case :)

pytorch-poster.pdf

bertmaher commented 5 years ago

I’d be interested in a design proposal for threading, especially since you have deep knowledge of other frameworks. I can’t promise whether it would mesh with our higher level goals (as @nadavrot mentioned we’re most interested in high throughput server workloads), or where it would fall in our roadmap, but at least it could be a useful point in the design space.

QiJune commented 5 years ago

Actually, we want to map a computation graph to a multi-core hardware. From previous discussion between @nhynes and @nadavrot, there are two kinds of mapping methods:

split the graph across multi-cores, an operator will be executed only on a core
parallelize outer loops of each operator, an operator will be executed on multi-core

There are some tradeoffs between latency and throughput in these two methods. As @bertmaher mentioned, there exits a design space, we may find a optimal point under some constraints after several pass of exploration.

What I am wondering is how to represent a multi-core program in our low-level IR. A natural way would be generate low-level IR for each core.

Following is an example which mapping an Convolution operator to a two-core system in output channel stationary division method. And after Convolution, Relu is done only in core0.

The original low-level IR is following:

declare {
  %input = weight float<8 x 28 x 28 x 1>, broadcast, 0.0
  %filter = weight float<16 x 5 x 5 x 1>, xavier, 25.0
}

program {
  %allo = alloc float<8 x 28 x 28 x 16>
  %conv = convolution [5 1 2 16] @out %allo, @in %input, @in %filter3, @in %bias0
  %allo0 = alloc float<8 x 28 x 28 x 16>
  %relu = relu @out %allo0, @in %allo
}

The transformed low-level IR is following after output channel stationary division(Please note that the default layout in Glow is NHWC):

declare {
  %input = weight float<8 x 28 x 28 x 1>, broadcast, 0.0
  %filter = weight float<16 x 5 x 5 x 1>, xavier, 25.0
}

program-core0 {
  %allo-core0 = alloc float<8 x 28 x 28 x 8>
  %conv-core0 = convolution [5 1 2 8] @out %allo-core0, @in %input, @in %filter3, @in %bias0
  sync_device_api
  %allo0-core0 = alloc float<8 x 28 x 28 x 16>
  %relu-core0 = relu @out %allo0-core0, @in %allo-core0
  %relu-core0 + 8 x 28 x 28 x 8  = relu @out (%allo0-core0 + 8 x 28 x 28 x 8) , @in %allo-core1
}

program-core1 {
  %allo-core1 = alloc float<8 x 28 x 28 x 8>
  %conv-core1 = convolution [5 1 2 8] @out %allo-core1, @in %input, @in %filter3, @in %bias0
  sync_device_api
}

This is just a trial version, and may contain some mistakes. I am not sure if the design break the whole roadmap. Any comment is welcomed. Thanks!

nhynes commented 5 years ago

@QiJune that's an interesting proposal. Partitioning operators per core would certainly help when the workload is consistent over the entire graph (e.g., ResNet, stacked RNNs) or when using a NUMA core. Reasoning about what should go where (i.e. load balancing the flops) might be a bit challenging, though.

Parallelizing ops is a bit more straightforward. For comparison, TVM will parallelize

To do this, the TVM compiler will actually pack the computation into a lambda function and send that to the runtime for execution. I'm still piecing together out how this should be implemented in the Glow compiler, but the approach of packing up microtasks and shelling out to the runtime is very flexible and, indeed, is what OpenMP does.

If you'd like, I can start a shared doc where we can collaborate on drafting a multi-threading proposal. I'm quite interested in such a feature and would be glad to help make it happen!

QiJune commented 5 years ago

@nhynes Sure, we can work together to make a design doc first. Glow compiler provides a very clean abstraction on graph level. For the high-performance operator implementation, we can borrow some ideas from TVM compiler.

nhynes commented 5 years ago

Cool, here's a link to the doc which I just created. I'll add to it over the course of this week.

From a high level it looks like the right place to add the parallelism transformation is during transformPostLowering. Although it could be done during low-level IR generation (which would allow optimization for things like reduced filter depths), backends like OpenCL have their own notion of paralleism/microtasks and also that the number of CPUs might be variable. It definitely needs to happen after compilation becomes specific to a backend.

QiJune commented 5 years ago

@nhynes Here is also a doc of adding NewBackendSpecificNode for reference.

mratsim commented 5 years ago

There are L1, L2, L3 cache considerations for parallelism and also solving false sharing, especially for operations that involve a reduction like matrix multiplication and convolution.

You cannot just parallelize the outer loop and hope to get a good result.

If we take the matmul implemented in libjit and apply the analysis from Anatomy of High-Performance Many-Threaded Matrix Multiplication we have the following parallelisation possibilities:

https://github.com/pytorch/glow/blob/d0fa695ba1eff8e64ab59520fc37247127f5adc1/lib/Backends/CPU/libjit/libjit_matmul.cpp#L225-L244

1. Parallelizing kc chunks, assuming the exchange that you did for this loop compared to BLIS and GotoBLAS doesn't affect much, this creates race conditions on C update:
1. Parallelizing nc chunks, again assuming that the loop exchange doesn't change anything, this is suitable for multi-socket architecture (Xeon-E5+, Epyc, Xeon Gold+) but not on common place hardware. Furthermore most matrices in deep learning are on size <300 while nc is chunked in size of 4096, so parallelizing here will most likely do nothing.
1. Parallelizing mc chunks, this is what the paper recommend for CPU arch with private L1 and L2 and shared L3 and what I'm using in my own code

Now other options include parallelising in the inner GEBP kernel https://github.com/pytorch/glow/blob/d0fa695ba1eff8e64ab59520fc37247127f5adc1/lib/Backends/CPU/libjit/libjit_matmul.cpp#L150-L158

1. Parallelizing nr chunks, this is what the paper recommends for CPU arch with private L1 but shared L2 and L3. I used to use it but got a 15% speed boost by parallelising over mc
1. Parallizing over mr. Too small and requires a parallel reduction.

My experience

I tried option 3 and 4 on my dual-core i5-5257U (mobile Broadwell 2.7Ghz, turbo 3.1 Ghz). The theoretical peak of my CPU is 172.8 Ghz (2.7 GHz 32 FLOPs/cycle 2 cores):

Backend	Average time	Performance
OpenBLAS	97.464 ms	145.241 GFlops/s
My own code - `mc` parallel	103.659 ms	136.561 GFlops/s
Halide	-	~128 GFlops/s
Apple Accelerate	114.519 ms	123.611 GFlop/s
My own code - `nr` parallel	-	~121 GFlops/s

Benchmark - Halide bench

nadavrot commented 5 years ago

@mratsim Thank you for this amazing analysis. Do you have performance numbers for the Glow implementation? Also, what matrix sizes are tested?

At the moment @gcatron and @beicy are working on implementing a different dimension of parallelism. They are partitioning the whole neural network across multiple cores and implement data-flow parallelism. This has the advantage that the weights of some network stay pinned to a specific core. I think that distributing the neural network across core should be more efficient than distributing a specific operator, when considering the problem of pipelining lots of data. What do you think?

nhynes commented 5 years ago

Fantastic benchamrks, for sure! @mratsim would you mind running one additional benchmark using TVM? The code and lowered schedule can be found here: https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#parallel

mratsim commented 5 years ago

Ah sorry I didn't mention the size, it was M=N=K=1920, probably a bit big for NN matrices, but I wanted the general case done before tackling small matrices.

I will try to run the bench from Nim, the language I'm using if I manage to wrap them (Nim can compile and call C++).

Edit

Regarding dataflow parallelism, I think it's needed for distributed learning on clusters. A bit like OpenMP + MPI, both are complementary, see this OpenMP+MPI presentation.

Abstracting that would be great and this is something Tensorflow is actively pursuing recently with their [DistributionStrategy](https://www.tensorflow.org/api_docs/python/tf/contrib/distribute/DistributionStrategy) API though I'm confusing about it's difference with their existing Tensorflow Distributed.

mratsim commented 5 years ago

I've done the benchmark for libjit, it reaches 85% of OpenBLAS single-threaded, not bad!

Changes for review https://github.com/numforge/laser/commit/2c9c8f10bda552468c6673b1234cb629f9a2bdcc

$ OPENBLAS_NUM_THREADS=1 ./build/bench_gemm
Warmup: 1.1973 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:     86.400 GFLOP/s
Theoretical peak multi:          172.800 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 1.893 seconds
Average time: 189.041 ms
Stddev  time: 2.558 ms
Min     time: 186.120 ms
Max     time: 193.507 ms
Perf:         74.882 GFLOP/s

Display output[0] to make sure it's not optimized away
470.7781372070312

Laser production implementation
Collected 10 samples in 1.942 seconds
Average time: 193.975 ms
Stddev  time: 4.279 ms
Min     time: 190.571 ms
Max     time: 205.327 ms
Perf:         72.978 GFLOP/s

Display output[0] to make sure it's not optimized away
470.778076171875

PyTorch Glow: libjit matmul implementation
Collected 10 samples in 2.216 seconds
Average time: 221.567 ms
Stddev  time: 2.270 ms
Min     time: 218.499 ms
Max     time: 225.679 ms
Perf:         63.889 GFLOP/s

Display output[0] to make sure it's not optimized away
470.778076171875

@nhynes For TVM bench (I also would like to properly integrate Halide), I need some more time to see how to call them from Nim.

mratsim commented 5 years ago

@laurae2 ran the benchmark on Dual Xeon Gold 6154 and Glow also reaches 66 GFlop/s even though the theoretical single-threaded peak is at 118 GFlop/s on M=N=K=2304 (2304=32*72, with 72 being the number of cores) (https://github.com/numforge/laser/issues/9)

I think Glow is memory-bandwidth starved.

mratsim commented 5 years ago

I have a new workstation with an overclocked skylake-X i9-9980XE at

4.1GHz all core turbo
4.0 GHz all core AVX turbo
3.6 GHz all core AVX512 turbo

In the process of adding AVX512 (https://github.com/numforge/laser/pull/14), I rerun the benchmark (PyTorch Glow compiled with -mavx -mfma, -mavx512f was significantly reducing performance) and unfortunately Glow is really bandwidth starved even when single threaded on much faster clocks.

My previous benchmark was done on dual channel i5-5257U 2.7GHz (3.1GHz all turbo) mobile broadwell from a MacBook Pro 2015, so memory was single or dual channel.

The following benchmark is done on the i9 (quad memory channels)

A matrix shape: (M: 3840, N: 3840)
B matrix shape: (M: 3840, N: 3840)
Output shape: (M: 3840, N: 3840)
Required number of operations: 113246.208 millions
Required bytes:                  117.965 MB
Arithmetic intensity:            960.000 FLOP/byte
Theoretical peak single-core AVX512:    230.400 GFLOP/s
Theoretical peak multi AVX512:         4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 0.499 seconds
Average time: 49.279 ms
Stddev  time: 3.924 ms
Min     time: 47.855 ms
Max     time: 60.436 ms
Perf:         2298.061 GFLOP/s

Display output[0] to make sure it's not optimized away
950.1965942382812
Laser production implementation - single threaded - AVX512
Collected 10 samples in 6.828 seconds
Average time: 682.218 ms
Stddev  time: 9.549 ms
Min     time: 667.896 ms
Max     time: 693.479 ms
Perf:         165.997 GFLOP/s

Display output[0] to make sure it's not optimized away
950.1968383789062

PyTorch Glow: libjit matmul implementation - AVX+FMA
Collected 10 samples in 17.060 seconds
Average time: 1705.967 ms
Stddev  time: 0.332 ms
Min     time: 1705.659 ms
Max     time: 1706.847 ms
Perf:         66.382 GFLOP/s

Display output[0] to make sure it's not optimized away
950.1965942382812

This confirms @laurae2 benchmarks on dual Xeon Gold 6154 (hexa-memory channel)

Warmup: 0.9943 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 2304, N: 2304)
B matrix shape: (M: 2304, N: 2304)
Output shape: (M: 2304, N: 2304)
Required number of operations: 24461.181 millions
Required bytes:                   42.467 MB
Arithmetic intensity:            576.000 FLOP/byte
Theoretical peak single-core AVX+FMA:    118.400 GFLOP/s
Theoretical peak multi AVX+FMA:         4262.400 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Intel MKL benchmark
Collected 100 samples in 0.658 seconds
Average time: 6.211 ms
Stddev  time: 2.274 ms
Min     time: 5.648 ms
Max     time: 28.398 ms
Perf:         3938.203 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

Laser production implementation - AVX2
Collected 100 samples in 4.067 seconds
Average time: 40.303 ms
Stddev  time: 12.542 ms
Min     time: 35.367 ms
Max     time: 121.945 ms
Perf:         606.927 GFLOP/s  ---> NUMA parallelization issue

Display output[0] to make sure it's not optimized away
566.68505859375

PyTorch Glow: libjit matmul implementation
Collected 100 samples in 36.837 seconds
Average time: 368.372 ms
Stddev  time: 3.071 ms
Min     time: 362.655 ms
Max     time: 380.193 ms
Perf:         66.403 GFLOP/s

Display output[0] to make sure it's not optimized away
566.6849975585938

Side-note on many-cores parallelism

When parallelizing on the i9 (18 cores) I could only reach 1.8 TFLOP/s with huge 3840*3840 matrices, even though single-threaded performance was 166 GFLOP/s.

Parallelizing a single loop is probably not enough with so many cores and BLIS has further advice in their multithreading README.

I did not check yet how they implement nested parallelism without oversubscribing (thread explosion say 18 threads on the first loop then 18*18 on the second one). Intel suggests using the recent OpenMP tasks contruct in the OpenMP nested parallelism article. I guess Intel TBB would also be a good match when using a task-based model.

mratsim commented 5 years ago

Some updates on parallelization in my own library.

Matrix multiplication and convolutions are probably the hardest to parallelize and I found a clean way using OpenMP to parallelize across the M and the N dimensions using OpenMP task while being easy to maintain and understand https://github.com/numforge/laser/blob/56df4643530ada0a01d29116feb626d93ac11379/laser/primitives/matrix_multiplication/gemm.nim#L74-L78

  # #####################################
  # 4. for jr = 0,...,nc−1 in steps of nr
  for jr in countup(0, tiles.nc-1, NR):
    omp_task("firstprivate(`jr`)"):
      let nr = min(nc - jr, NR)                        # C[ic:ic+mc, jc+jr:jc+jr+nr]

https://github.com/numforge/laser/blob/56df4643530ada0a01d29116feb626d93ac11379/laser/primitives/matrix_multiplication/gemm.nim#L163-L166

    omp_parallel_if(parallelize):
      # ####################################
      # 3. for ic = 0,...,m−1 in steps of mc
      omp_for(ict, tiles.ic_num_tasks, use_simd=false, nowait=true):

In Glow that will boils down to replacing this

https://github.com/pytorch/glow/blob/f4ab7a946922efa501a9db886c71ae5b834d4f6a/lib/Backends/CPU/libjit/libjit_matmul.cpp#L150-L158

and

https://github.com/pytorch/glow/blob/f4ab7a946922efa501a9db886c71ae5b834d4f6a/lib/Backends/CPU/libjit/libjit_matmul.cpp#L226-L244

by something in the lines of:

libjit_matmul_outer(size_t m, size_t n, size_t k, const float *a, size_t lda,
                    const float *b, size_t ldb, float *c, size_t ldc) {
  float packedB[kc * nc] __attribute__((aligned(64)));

  for (size_t p = 0; p < k; p += kc) {
    size_t pb = MIN(k - p, kc);
    for (size_t j = 0; j < n; j += nc) {
      size_t jb = MIN(n - j, nc);
      if (pack) {
        pack_matrix_b<regsB>(jb, pb, &B(p, j), ldb, packedB);
      }
      #pragma omp parallel for nowait if(<size condition for parallelization>)
      for (size_t i = 0; i < m; i += mc) {
        size_t ib = MIN(m - i, mc);
        libjit_matmul_inner<pack>(ib, jb, pb, &A(i, p), lda, &B(p, j), ldb,
                                  &C(i, j), ldc, packedB);

      }
    }
  }
}

void libjit_matmul_inner_packed(int m, int n, int k, const float *packedA,
                                const float *packedB, float *c, int ldc) {
  for (int j = 0; j < n - nr + 1; j += nr) {
    #pragma omp task firstprivate(j)
    for (int i = 0; i < m - mr + 1; i += mr) {
      libjit_matmul_zdot<regsA, regsB>(k, &packedA[i * k], mr, &packedB[j * k],
                                       k, &C(i, j), ldc);
    }
  }
}

The bigger parallel picture

Reminder, by convention, C = A * B with C of shape MxN, A of shape MxK, B of shape KxN

Matrix multiplication sizes can vary wildly:
- You want to classify 10 classes and use a batch size of 64, result is of shape 64x10 --> partitioning along M and N is not very useful and BLIS papers mentions that partitioning along K is very hard
- You build a language model, you work on 100 parallel sequences with 50000 word vocabulary, result shape is 100x50000 --> you have excellent parallelism in the N dimension.
Also true for convolution
Small matrix multiplications cannot be distributed

This makes the approach in https://github.com/pytorch/glow/issues/1749#issuecomment-447401410 worthwhile. Matrix multiplication with sizes below 200x200 should not be distributed as the matrices can fit in the L1/L2 cache of a single core. So parallelism should be found at a higher level.

Also true for convolution
The very last (classification) layer cannot be distributed

Often matrix multiplication is only used for the classifier part, the very last layer and so you cannot benefit from layer level parallelism. This will be a bottleneck. Note that RNNs also uses matrix multiplication inside though I'm not sure how to do layer-level parallelism on layers that are inherently sequential.

This does not apply to convolution.

Some benchmarks

I've integrated MKL-DNN in my benchmarks. I also added a ML sized bench with 224x224 matrices.

Serial 1920x1920

$  OMP_NUM_THREADS=1 ./build/gemm_f32_omp 

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Reference loop
Collected 10 samples in 11.507 seconds
Average time: 1150.122 ms
Stddev  time: 2.026 ms
Min     time: 1146.970 ms
Max     time: 1153.074 ms
Perf:         12.308 GFLOP/s

Simple Tiling
Collected 10 samples in 12.588 seconds
Average time: 1258.245 ms
Stddev  time: 0.501 ms
Min     time: 1257.867 ms
Max     time: 1259.622 ms
Perf:         11.250 GFLOP/s

OpenBLAS benchmark
Collected 10 samples in 1.242 seconds
Average time: 123.665 ms
Stddev  time: 0.502 ms
Min     time: 123.330 ms
Max     time: 125.076 ms
Perf:         114.469 GFLOP/s

Laser production implementation
Collected 10 samples in 0.966 seconds
Average time: 96.047 ms
Stddev  time: 0.331 ms
Min     time: 95.445 ms
Max     time: 96.521 ms
Perf:         147.383 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10 samples in 1.925 seconds
Average time: 192.467 ms
Stddev  time: 8.596 ms
Min     time: 189.652 ms
Max     time: 216.929 ms
Perf:         73.549 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10 samples in 6.236 seconds
Average time: 623.140 ms
Stddev  time: 0.373 ms
Min     time: 622.809 ms
Max     time: 624.121 ms
Perf:         22.717 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10 samples in 1.331 seconds
Average time: 132.598 ms
Stddev  time: 4.019 ms
Min     time: 131.271 ms
Max     time: 144.035 ms
Perf:         106.757 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.788 seconds
Average time: 78.311 ms
Stddev  time: 9.011 ms
Min     time: 75.338 ms
Max     time: 103.953 ms
Perf:         180.764 GFLOP/s

Parallel 1920x1920

$  ./build/gemm_f32_omp 

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Reference loop
Collected 10 samples in 11.569 seconds
Average time: 1156.285 ms
Stddev  time: 1.460 ms
Min     time: 1153.774 ms
Max     time: 1158.649 ms
Perf:         12.242 GFLOP/s

Simple Tiling
Collected 10 samples in 19.068 seconds
Average time: 1906.252 ms
Stddev  time: 29.195 ms
Min     time: 1881.130 ms
Max     time: 1979.256 ms
Perf:         7.426 GFLOP/s

OpenBLAS benchmark
Collected 10 samples in 0.082 seconds
Average time: 7.529 ms
Stddev  time: 3.143 ms
Min     time: 6.320 ms
Max     time: 16.402 ms
Perf:         1880.168 GFLOP/s

Laser production implementation
Collected 10 samples in 0.090 seconds
Average time: 8.399 ms
Stddev  time: 3.267 ms
Min     time: 7.239 ms
Max     time: 17.682 ms
Perf:         1685.448 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10 samples in 2.033 seconds
Average time: 203.252 ms
Stddev  time: 0.732 ms
Min     time: 202.921 ms
Max     time: 205.327 ms
Perf:         69.646 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10 samples in 0.367 seconds
Average time: 35.966 ms
Stddev  time: 3.968 ms
Min     time: 34.645 ms
Max     time: 47.257 ms
Perf:         393.585 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10 samples in 0.104 seconds
Average time: 9.622 ms
Stddev  time: 5.510 ms
Min     time: 7.771 ms
Max     time: 25.289 ms
Perf:         1471.250 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.088 seconds
Average time: 8.208 ms
Stddev  time: 10.563 ms
Min     time: 4.626 ms
Max     time: 38.223 ms
Perf:         1724.624 GFLOP/s

Serial 224x224


Simple Tiling
Collected 1000 samples in 2.022 seconds
Average time: 2.018 ms
Stddev  time: 0.003 ms
Min     time: 2.015 ms
Max     time: 2.083 ms
Perf:         11.137 GFLOP/s

OpenBLAS benchmark
Collected 1000 samples in 0.158 seconds
Average time: 0.155 ms
Stddev  time: 0.003 ms
Min     time: 0.154 ms
Max     time: 0.262 ms
Perf:         145.093 GFLOP/s

Laser production implementation
Collected 1000 samples in 0.317 seconds
Average time: 0.314 ms
Stddev  time: 0.045 ms
Min     time: 0.241 ms
Max     time: 0.464 ms
Perf:         71.659 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 1000 samples in 2.969 seconds
Average time: 2.969 ms
Stddev  time: 0.019 ms
Min     time: 2.919 ms
Max     time: 3.064 ms
Perf:         7.570 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 1000 samples in 1.028 seconds
Average time: 1.025 ms
Stddev  time: 0.002 ms
Min     time: 1.019 ms
Max     time: 1.046 ms
Perf:         21.926 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 1000 samples in 0.201 seconds
Average time: 0.199 ms
Stddev  time: 0.345 ms
Min     time: 0.187 ms
Max     time: 11.098 ms
Perf:         113.131 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 1000 samples in 0.136 seconds
Average time: 0.133 ms
Stddev  time: 0.839 ms
Min     time: 0.106 ms
Max     time: 26.633 ms
Perf:         168.522 GFLOP/s

Parallel 224x224

A matrix shape: (M: 224, N: 224)
B matrix shape: (M: 224, N: 224)
Output shape: (M: 224, N: 224)
Required number of operations:    22.479 millions
Required bytes:                    0.401 MB
Arithmetic intensity:             56.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Reference loop
Collected 1000 samples in 1.429 seconds
Average time: 1.426 ms
Stddev  time: 0.161 ms
Min     time: 1.413 ms
Max     time: 4.882 ms
Perf:         15.764 GFLOP/s

Simple Tiling
Collected 1000 samples in 3.020 seconds
Average time: 3.010 ms
Stddev  time: 0.975 ms
Min     time: 2.656 ms
Max     time: 26.373 ms
Perf:         7.469 GFLOP/s

OpenBLAS benchmark
Collected 1000 samples in 0.077 seconds
Average time: 0.064 ms
Stddev  time: 0.241 ms
Min     time: 0.050 ms
Max     time: 5.632 ms
Perf:         350.693 GFLOP/s

Laser production implementation
Collected 1000 samples in 0.241 seconds
Average time: 0.228 ms
Stddev  time: 0.340 ms
Min     time: 0.123 ms
Max     time: 10.756 ms
Perf:         98.712 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 1000 samples in 2.932 seconds
Average time: 2.932 ms
Stddev  time: 0.018 ms
Min     time: 2.921 ms
Max     time: 3.277 ms
Perf:         7.666 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 1000 samples in 0.201 seconds
Average time: 0.187 ms
Stddev  time: 0.050 ms
Min     time: 0.157 ms
Max     time: 0.651 ms
Perf:         120.002 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 1000 samples in 0.058 seconds
Average time: 0.044 ms
Stddev  time: 0.497 ms
Min     time: 0.018 ms
Max     time: 13.616 ms
Perf:         506.395 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 1000 samples in 0.062 seconds
Average time: 0.047 ms
Stddev  time: 0.923 ms
Min     time: 0.012 ms
Max     time: 29.201 ms
Perf:         479.256 GFLOP/s

@nhynes I've been looking into how to integrate Halide and TVM in those but due to the code generation it's quite tricky

mratsim commented 4 years ago

@nadavrot @gcatron @beicy I've made some advances on my own compiler and research.

Partitioning the graph would indeed provide higher level parallelism that would more efficient than multithreading at the loop-level, especially on NUMA systems which often suffers from naive parallel for loops or memory allocation.

I did not check your latest research/implementation but the main issue I foresee is not providing enough parallelism. Some NN architectures have straightforward mapping to multi-cores, like GANs or dual-geaded NN with a CNN head for images and another CNN/RNN head for speech or text. However for many feed-forward architectures, parallelism at the computation-graph node is only available during backpropagation when you backpropagate through a function with 2 or more inputs and distribute 1 gradient per socket/core.

Regarding this, Cpp-Taskflow by @tsung-wei-huang is quite interesting and according to his paper he is exploring parallelizing Tensorflow computation graph.

There are feedforward benchmarks (MNIST and parallel-dnn) in Cpp-Taskflow example folder but as shown in my 18-cores machine, I can only get a 2x to 3x speedup (https://github.com/cpp-taskflow/cpp-taskflow/issues/97).

So you would need both task-parallelism and data-parallelism to best use CPU cores.

On GPU though, data-parallelism is already implicit/handled by Cuda/Cudnn and you can use Cuda async to launch kernels with no dependencies between each other, it might be easily mappable to Cpp-Taskflow task model and unlike CPU, there is no risk of runtimes interfering with each other and oversubscription.

For multi-GPU system, it's probably interesting to explore Cuda Unified Memory, while it used to have a lot of overhead on Kepler generation, I expect it improved a lot and this would avoid having to manage memory migration in the tasking system. Furthermore, Unified Memory is the easiest way to benefit from NVlink and we wouldn't need NCCL on a single machine to still have efficient multi-GPU computation (and NCCL doesn't work on Windows).

mratsim commented 4 years ago

In the past 6 months I've been doing a deep dive into multithreading runtimes for deep learning compilers. I have now a much deeper understanding of parallelism and how it could be best harnessed to maximize throughput of deep learning workloads.

I have implemented most of the following in my Weave runtime.

I will also be giving a talk on the subject of parallel runtime at FOSDEM in Brussels in a month (normally recorded and livestreamed).

Parallel paradigms on CPU

There are as far as I'm aware, 3 dominant parallel paradigms on CPU. A multithreading runtime for deep learning applications requires all 3.

Data-parallelism

This is the very well known parallel for loop supported by OpenMP and most runtimes. The way it is implemented has however tremendous implications on the runtime flexibility to cope with hardware and workload size. To be covered in a later paragraph with PyTorch specific examples.

Task parallelism

The spawn/sync from Cilk or async/await in futures-based control flow. This facilitates implementing recursive divide-and-conquer algorithms for linear algebra kernels and also parallelize tree algorithms like MonteCarloTreeSearch for ELF, Beam Search for NLP or random forests / gradient boosted trees.

This is less critical for regular tensor processing (feed-forward NN like ResNet).

Dataflow parallelism

Also called graph parallelism, pipeline parallelism, stream parallelism, data-driven task parallelism.

The idea is that you schedule delayed tasks that depends on another inputs (OpenMP depends clause, Cpp-Taskflow after/precede, TBB FlowGraph) and may also produce outputs required by an input down the pipeline/stream/graph. i.e. what was mentioned there https://github.com/pytorch/glow/issues/1749#issuecomment-447401410 and what is being worked on for Glow AFAIK.

Scheduler challenges

Load-balancing

Dataflow parallelism

As mentioned in my previous post, using a pure dataflow approach misses a lot of parallelism opportunities at the intra-tensor level.

Data parallelism

On another dimension, using plain OpenMP parallel for will create load-balancing/grain-size issues on generic algorithms. This is heavily detailed in https://github.com/zy97140/omp-benchmark-for-pytorch. PyTorch uses the parallel for algorithm for tensor addition and tensor exponentiation but with OpenMP, one is worth parallelizing on most machine and the other is only worth parallelizing on slow machines.

Task parallelism

Some algorithms may create lots of task on one branch of the tree and none on the other, a scheduler that has no load-balancing scheme for tasks will peg CPU to 100% but other will be left unused.

The usual approach is work-stealing or another greedy scheduler like Parallel Depth-First scheduler. A greedy scheduler respects the busy-leaves property (as long as there is pending work, all cores are busy). The Cilk paper proves that a greedy scheduler can be at most 2x slower than the optimal schedule.

Also, in particular for recursive tree algorithms, the leaf task may be very small compared to scheduling overhead, to reduce overhead the runtime should be able to dynamically package tasks together if they are too small (steal-half strategy).

Note that Intel TBB is depth-restricted to limit unbounded memory growth and there are proven cases where it doesn't scale as it doesn't maintain the busy leaves property.

Data dependencies and barriers

A compiler approach can be of tremendous help to guarantee producer-consumer relationships (see Halide/TVM) and can satisfies itself with a simpler threadpool implementation, however I think a runtime with first-class dataflow would provide lots of benefits as well.

Assuming you solve the grain size issue of parallel for with something like lazy task splitting so that you split the task only when they are thieves, you cannot use barriers anymore at all, because there is no guarantee that more than 1 thread will actually enter the barrier, and you will deadlock.

So expressing dependencies based on control-flow (barriers, locks, condvar) wouldn't work. This is problematic for many algorithms, for example optimized matrix multiplication which requires packing to be done before computing tile-size matrix multiplication. Another impacted workload would be recurrent neural network which requires specific ordering. Having dataflow parallelism also allows the wave-front pattern to trigger tasks as soon as there dependencies are ready and was key to get high performance RNN in CuDNN.

Composition

OpenMP for loops are non-nestable which is a huge limiting factor. A batched matrix multiplication would require parallelizing at the outer level if we have 256x32x32 matrices, but at the inner level if we have 3x224x224 matrices. This is very hard to allow with OpenMP based GEMM.

NUMA & Distributed computing

I'm leaving aside those as I didn't implement them yet. However my runtime unlike most was designed assuming a message-passing model instead of a shared-memory model so synchronizations is done through channels instead of atomics.

nickgg commented 4 years ago

Hi @mratsim. Thanks for your data above and for your updates.

At this point the performance of Glow's CPU backend on multithreaded systems is not our highest priority, since Pytorch is already good at that. We're concentrating on making Glow a good framework for implementing backends to AI Accelerator devices, such as the Intel NNPI and Habana backends already integrated into Glow.

What's your goal here? Are you interested in improving the multithreaded performance of the Glow CPU backend?

mratsim commented 4 years ago

I'm also building a deep learning compiler but I'm focusing on the runtime first to ensure that it can meet the demands of complex deep learning kernels that require data, task and graph parallelism to maximize throughput.

So I'm sharing what I learned while building this runtime and the various challenges and use/corner cases that I encounter. I can review PRs or contribute to design, I however unfortunately don't have the time to contribute actual code.

shrutiramesh1988 commented 4 years ago

Is there any way to integrate GLOW with PyTorch so that we can get additional performance gain for inference throughput? Also if GLOW's image-classifier is implemented as multi-threaded, it would be really helpful to compare performance of PyTorch vs GLOW and evaluate the % performance gain we can obtain by GLOW

jfix71 commented 4 years ago

@shrutiramesh1988 We have torch_glow which is our intended long-term path from PyTorch to Glow. We also could execute PyTorch models if they are converted to ONNX or Caffe2 -- see this link.

pytorch / glow

Multi-threaded CPU kernels #1749

My experience

Side-note on many-cores parallelism

The bigger parallel picture

Some benchmarks

Serial 1920x1920

Parallel 1920x1920

Serial 224x224

Parallel 224x224

Parallel paradigms on CPU

Data-parallelism

Task parallelism

Dataflow parallelism

Scheduler challenges

Load-balancing

Dataflow parallelism

Data parallelism

Task parallelism

Data dependencies and barriers

Composition

NUMA & Distributed computing