Closed sonots closed 6 years ago
#1 0x00007ffff37eca2d in ndloop_copy_to_buffer (lp=0x555556aa86c0) at narray/ndloop.c:1138
#2 0x00007ffff37ed391 in loop_narray (nf=0x7fffffffad20, lp=0x7fffffffaa70) at narray/ndloop.c:1361
#3 0x00007ffff37ed28a in ndloop_run (vlp=140737488333424) at narray/ndloop.c:1325
#4 0x000055555557aea8 in rb_ensure ()
#5 0x00007ffff37ed964 in na_ndloop_main (nf=0x7fffffffad20, args=93824999531120, opt_ptr=0x7fffffffad50) at narray/ndloop.c:1437
#6 0x00007ffff37edcfb in na_ndloop3 (nf=0x7fffffffad20, ptr=0x7fffffffad50, argc=3) at narray/ndloop.c:1504
#7 0x00007ffff38aae26 in dfloat_gemm (argc=1, argv=0x7ffff7ed83b8, self=93824999518880) at narray/gen/tmpl/gemm.c:151
The copy happens in gemm computation.
To avoid such copy, we should avoid using ndloop, and should use stridedBatchedGemm https://devblogs.nvidia.com/cublas-strided-batched-matrix-multiply/.
a = Cumo::SFloat.new(3, 4).seq(0)
b = Cumo::SFloat.new(3, 4).seq(0)
a.gemm(b.transpose)
This operates 12 times memcpy. If another array is not transposed, this thing does not happen.
Issued to numo, too https://github.com/ruby-numo/numo-narray/issues/95
are very slow (amazingly, cudaGetLastError is also slow).
They are all coming from buffering mechanism of numo framework at ndloop.c. https://github.com/sonots/cumo/blob/2cfa97c9d8fe32c2f9c71a3e722b28950af170bf/ext/cumo/narray/ndloop.c#L1137