naitoh commented 6 years ago

As a result of comparing Numo::NArray's broadcasting processing speed with numpy, there are two questions.

Environment

CentOS Linux release 7.3.1611 (Core) x86_64
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux] numo-linalg (0.1.2) numo-narray (0.9.1.2)
Python 2.7.5 numpy 1.14.2

Benchmarked code

$ cat broadcast.rb 
require 'benchmark'
require 'numo/narray'

num_iteration = 10000

Benchmark.bm 20 do |r|
  x = Numo::SFloat.ones([1000,784])
  y = Numo::SFloat.ones([1000,784])
  r.report "x.inplace + y" do
    num_iteration.times do
      x.inplace + y
    end
  end

  x = Numo::SFloat.ones([1000,784])
  y = Numo::SFloat.ones([1000,784])
  r.report "x.inplace + 1.0" do
    num_iteration.times do
      x.inplace + 1.0
    end
  end

  x = Numo::SFloat.ones([1000,784])
  y = Numo::SFloat.ones([1000,784])
  r.report "x.inplace - y" do
    num_iteration.times do
      x.inplace - y
    end
  end

  x = Numo::SFloat.ones([1000,784])
  y = Numo::SFloat.ones([1000,784])
  r.report "x.inplace - 1.0" do
    num_iteration.times do
      x.inplace - 1.0
    end
  end

  x = Numo::SFloat.ones([1000,784])
  y = Numo::SFloat.ones([1000,784])
  r.report "x.inplace * y" do
    num_iteration.times do
      x.inplace * y
    end
  end

  x = Numo::SFloat.ones([1000,784])
  y = Numo::SFloat.ones([1000,784])
  r.report "x.inplace * 1.0" do
    num_iteration.times do
      x.inplace * 1.0
    end
  end

  x = Numo::SFloat.ones([1000,784])
  y = Numo::SFloat.ones([1000,784])
  r.report "x.inplace / y" do
    num_iteration.times do
      x.inplace / y
    end
  end

  x = Numo::SFloat.ones([1000,784])
  y = Numo::SFloat.ones([1000,784])
  r.report "x.inplace / 1.0" do
    num_iteration.times do
      x.inplace / 1.0
    end
  end
end

$ cat broadcast.py

from benchmarker import Benchmarker
import numpy as np

## specify number of loop
with Benchmarker(10000, width=20) as bench:

    @bench(None)                ## empty loop
    def _(bm):
        x = np.ones([1000,784], dtype=np.float32)
        y = np.ones([1000,784], dtype=np.float32)
        for i in bm:
            pass

    @bench("x += y")
    def _(bm):
        x = np.ones([1000,784], dtype=np.float32)
        y = np.ones([1000,784], dtype=np.float32)
        for i in bm:
            x += y

    @bench("x += 1.0")
    def _(bm):
        x = np.ones([1000,784], dtype=np.float32)
        y = np.ones([1000,784], dtype=np.float32)
        for i in bm:
            x += 1.0

    @bench("x -= y")
    def _(bm):
        x = np.ones([1000,784], dtype=np.float32)
        y = np.ones([1000,784], dtype=np.float32)
        for i in bm:
            x -= y

    @bench("x -= 1.0")
    def _(bm):
        x = np.ones([1000,784], dtype=np.float32)
        y = np.ones([1000,784], dtype=np.float32)
        for i in bm:
            x -= 1.0

    @bench("x *= y")
    def _(bm):
        x = np.ones([1000,784], dtype=np.float32)
        y = np.ones([1000,784], dtype=np.float32)
        for i in bm:
            x *= y

    @bench("x *= 1.0")
    def _(bm):
        x = np.ones([1000,784], dtype=np.float32)
        y = np.ones([1000,784], dtype=np.float32)
        for i in bm:
            x *= 1.0

    @bench("x /= y")
    def _(bm):
        x = np.ones([1000,784], dtype=np.float32)
        y = np.ones([1000,784], dtype=np.float32)
        for i in bm:
            x /= y

    @bench("x /= 1.0")
    def _(bm):
        x = np.ones([1000,784], dtype=np.float32)
        y = np.ones([1000,784], dtype=np.float32)
        for i in bm:
            x /= 1.0

Result

$ ruby broadcast.rb 
                           user     system      total        real
x.inplace + y          7.010000   0.020000   7.030000 (  7.623628)
x.inplace + 1.0        6.230000   0.020000   6.250000 (  6.787102)
x.inplace - y          6.870000   0.020000   6.890000 (  7.464583)
x.inplace - 1.0        6.250000   0.020000   6.270000 (  6.805960)
x.inplace * y          6.540000   0.020000   6.560000 (  7.113648)
x.inplace * 1.0        6.400000   0.010000   6.410000 (  6.953627)
x.inplace / y         20.450000   0.040000  20.490000 ( 22.241986)
x.inplace / 1.0       19.850000   0.060000  19.910000 ( 21.577150)

$ ruby  -r numo/linalg/use/atlas broadcast.rb 
                           user     system      total        real
x.inplace + y          7.040000   0.010000   7.050000 (  7.691095)
x.inplace + 1.0        6.290000   0.010000   6.300000 (  6.832667)
x.inplace - y          6.910000   0.020000   6.930000 (  7.516793)
x.inplace - 1.0        6.500000   0.010000   6.510000 (  7.057587)
x.inplace * y          6.800000   0.010000   6.810000 (  7.399254)
x.inplace * 1.0        6.640000   0.020000   6.660000 (  7.222405)
x.inplace / y         20.470000   0.050000  20.520000 ( 22.274015)
x.inplace / 1.0       19.930000   0.040000  19.970000 ( 21.674572)

The division of Numo::NArray is considerably slow compared to others. why? (Although this is not a problem limited to broadcasting.)

$ python broadcast.py 
## benchmarker:         release 4.0.1 (for python)
## python version:      2.7.5
## python compiler:     GCC 4.8.5 20150623 (Red Hat 4.8.5-16)
## python platform:     Linux-3.10.0-514.6.1.el7.x86_64-x86_64-with-centos-7.3.1611-Core
## python executable:   /usr/bin/python
## cpu model:           Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz  # 2299.943 MHz
## parameters:          loop=10000, cycle=1, extra=0

##                        real    (total    = user    + sys)
(Empty)                 0.0023    0.0100    0.0100    0.0000
x += y                  5.9541    5.4800    5.4500    0.0300
x += 1.0                2.3246    2.1300    2.1300    0.0000
x -= y                  6.0288    5.5500    5.5400    0.0100
x -= 1.0                2.2778    2.0900    2.0900    0.0000
x *= y                  5.9875    5.5100    5.5000    0.0100
x *= 1.0                2.3914    2.1900    2.1800    0.0100
x /= y                  6.4979    5.9800    5.9600    0.0200
x /= 1.0                5.7342    5.2700    5.2600    0.0100

The division of numpy is comparable speed compared to others.
1. In the case of numpy, the processing is more than twice as fast as broadcasting. (Other than division), Can Numo::NArray also improve broadcasting processing speed like numpy?

Best regards.

kojix2 commented 6 years ago

This is a very interesting result.

Try2Code commented 6 years ago

without knowing the exact compile flags for both numpy and narray it doesnt make sense to change anything in the narray source code. differences in optimisation or vectorisation can easily lead to runtime differences like this.

masa16 commented 6 years ago

I do not know the reason. My guess is that C compiler failed to generate optimized code, and Numpy has special treatment for that. I have not found how to make it faster.

naitoh commented 6 years ago

Thank you for your reply.

I have created a patch that can improve the performance of Broadcast, so I sent a pull request #94.

Best regards.

naitoh commented 6 years ago

As for the two questions, the problem has been fixed. Thank you very much.

ruby-numo / numo-narray

Comparison of broadcasting processing speed and numpy. #93

Environment

Benchmarked code

Result