ruby-numo / numo-narray

Ruby/Numo::NArray - New NArray class library
http://ruby-numo.github.io/narray/
BSD 3-Clause "New" or "Revised" License
413 stars 41 forks source link

Ractor support #199

Closed mrkn closed 2 years ago

mrkn commented 2 years ago

I want to let numo-narray support Ractor in this pull request.

The following changes are made:

I keep Numo::RObject non-sharable because its instances can have compound objects such as Array and Hash.

@masa16 Could you please take a look?

orlando-labs commented 2 years ago

Hi. I tried given branch mrkn:ractor_support with my current project and found out performance issues. I tried to isolate the issue with the simple benchmark

require 'benchmark'
require 'numo/narray'

Warning[:experimental] = false

puts 'Testing Numo'

data = Ractor.make_shareable Array.new(1_000_000) { Numo::SFloat.new(10).rand(100) }

Benchmark.bm do |bm|
  bm.report('no ractor') do
    4.times { data.each &:mean }
  end

  bm.report('1 ractor') do
    Ractor.new(data) do |arr|
      4.times { arr.each &:mean }
      nil
    end.take
  end

  bm.report('2 ractors') do
    2.times.map do
      Ractor.new(data) do |arr|
        2.times { arr.each &:mean }
        nil
      end
    end.each &:take
  end

  bm.report('4 ractors') do
    4.times.map do
      Ractor.new(data) do |arr|
        arr.each &:mean
        nil
      end
    end.each &:take
  end
end

puts 'Testing core Array'

data = Ractor.make_shareable Array.new(2_000_000) { Array.new(10) { Random.rand } }

Benchmark.bm do |bm| 
  bm.report('no ractor') do
    4.times { data.each { |v| v.sum / v.size.to_f } }
  end

  bm.report('1 ractor') do
    Ractor.new(data) do |arr|
      4.times { arr.each { |v| v.sum / v.size.to_f } }
      nil
    end.take
  end

  bm.report('2 ractors') do
    2.times.map do
      Ractor.new(data) do |arr|
        2.times { arr.each { |v| v.sum / v.size.to_f } }
        nil
      end
    end.each &:take
  end

  bm.report('4 ractors') do
    4.times.map do
      Ractor.new(data) do |arr|
        arr.each { |v| v.sum / v.size.to_f }
        nil
      end
    end.each &:take
  end
end

Running on Ruby 3.1, Centos 8, it produces the following output on idling 14-core xeon e5-2680 v4.

Testing Numo
       user     system      total        real
no ractor  5.259529   0.021958   5.281487 (  5.292886)
1 ractor  6.039938   0.114778   6.154716 (  6.116115)
2 ractors 17.098116   2.135474  19.233590 ( 10.368513)
4 ractors 27.108385   7.667887  34.776272 ( 10.787219)
Testing core Array
       user     system      total        real
no ractor  1.408945   0.000000   1.408945 (  1.411900)
1 ractor  1.742667   0.028465   1.771132 (  1.774470)
2 ractors  1.458583   0.000000   1.458583 (  0.735995)
4 ractors  1.495018   0.000000   1.495018 (  0.385232)

For some reason the performance of multiple Ractors calculating Numo arrays degrades significantly

mrkn commented 2 years ago

@orlando-labs At first, you need to understand that numo-narray is not always faster than Array. Numo-narray is designed for operating large numeric arrays. So testing with 10-length arrays is very disadvantageous for numo-narray.

With the following benchmark, you can see the running time chagnes in the differnt way between numo-narray and normal array. Numo-narray is slower than normal array when array_len < 1000, but it is faster than normal array when array_len > 1000, on my machine.

require 'benchmark'
require 'numo/narray'

array_count = 10000
[10, 100, 1000, 10000].each do |array_len|
  data_numo = Array.new(array_count) { Numo::SFloat.new(array_len).rand(100) }
  data_ary = Array.new(array_count) { Array.new(array_len) { Random.rand } }

  puts
  puts "# array_len = #{array_len}"
  puts

  Benchmark.bm do |bm|
    bm.report('numo') do
      4.times { data_numo.each &:mean }
    end

    bm.report('array') do
      4.times { data_ary.each { |v| v.sum / v.size.to_f } }
    end
  end
end
# array_len = 10

       user     system      total        real
numo  0.041033   0.000000   0.041033 (  0.041040)
array  0.004684   0.000000   0.004684 (  0.004685)

# array_len = 100

       user     system      total        real
numo  0.048274   0.000000   0.048274 (  0.048298)
array  0.014097   0.000000   0.014097 (  0.014101)

# array_len = 1000

       user     system      total        real
numo  0.086663   0.000000   0.086663 (  0.086707)
array  0.108927   0.000000   0.108927 (  0.108994)

# array_len = 10000

       user     system      total        real
numo  0.399808   0.000000   0.399808 (  0.400040)
array  1.062646   0.000000   1.062646 (  1.063160)

With the following benchmark code that is similar to yours, numo-narray is faster than normal array.

require 'benchmark'
require 'numo/narray'

Warning[:experimental] = false

array_len = 10000
array_count = 10000

puts 'Testing Numo'

data = Ractor.make_shareable Array.new(array_count) { Numo::SFloat.new(array_len).rand(100) }

Benchmark.bm do |bm|
  bm.report('no ractor') do
    4.times { data.each &:mean }
  end

  bm.report('1 ractor') do
    Ractor.new(data) do |arr|
      4.times { arr.each &:mean }
      nil
    end.take
  end

  bm.report('2 ractors') do
    2.times.map do
      Ractor.new(data) do |arr|
        2.times { arr.each &:mean }
        nil
      end
    end.each &:take
  end

  bm.report('4 ractors') do
    4.times.map do
      Ractor.new(data) do |arr|
        arr.each &:mean
        nil
      end
    end.each &:take
  end
end

puts 'Testing core Array'

data = Ractor.make_shareable Array.new(2*array_count) { Array.new(array_len) { Random.rand } }

Benchmark.bm do |bm|
  bm.report('no ractor') do
    4.times { data.each { |v| v.sum / v.size.to_f } }
  end

  bm.report('1 ractor') do
    Ractor.new(data) do |arr|
      4.times { arr.each { |v| v.sum / v.size.to_f } }
      nil
    end.take
  end

  bm.report('2 ractors') do
    2.times.map do
      Ractor.new(data) do |arr|
        2.times { arr.each { |v| v.sum / v.size.to_f } }
        nil
      end
    end.each &:take
  end

  bm.report('4 ractors') do
    4.times.map do
      Ractor.new(data) do |arr|
        arr.each { |v| v.sum / v.size.to_f }
        nil
      end
    end.each &:take
  end
end
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-linux]
Testing Numo
       user     system      total        real
no ractor  0.326137   0.000217   0.326354 (  0.326562)
1 ractor  0.355712   0.000000   0.355712 (  0.355770)
2 ractors  0.383952   0.000106   0.384058 (  0.201436)
4 ractors  0.396320   0.000000   0.396320 (  0.106362)
Testing core Array
       user     system      total        real
no ractor  2.058136   0.000000   2.058136 (  2.059276)
1 ractor  2.062877   0.000000   2.062877 (  2.063872)
2 ractors  2.098526   0.000000   2.098526 (  1.052108)
4 ractors  2.203544   0.000006   2.203550 (  0.560517)
orlando-labs commented 2 years ago

Hi, @mrkn. Thanks for the response. I appreciate it. And it stays unclear why my example leads to growing processing times: 4 ractors with quarter-load did the job 1.5 times slower than 1 ractor with a full load. With yours 10k-sized, I see expected speedup.

kojix2 commented 2 years ago

This is an article that ko1, a developer of Ractor, posted on his company Cookpad's blog about a year ago. https://techlife.cookpad.com/entry/2020/12/26/131858 [Japanese]

Here, he says that using Ractor can be slower than not using it.

In the previous example, we were able to achieve a speedup of almost 4 times. However, this is a best case, or champion data, example that works well.

He writes that slow referencing of constants is one of the reasons why Ractor is slow.

  • The inline cache used for constant lookups was not thread-safe, so the cache was disabled except for the main Ractor.
  • The constant table is shared among Ractor, so it is locked, but if the lock conflicts, it is very slow.

He has written that he will fix this problem, so constant referencing may not be slow now.

As multiple-core CPUs become commonplace, the need to describe parallel computation is increasing. This phrase has been a standard preamble for more than 10 years when I was doing research at university. In fact, I don't think anyone would disagree that parallel computing is essential for writing high-performance software.

In order to perform parallel computation, the program must support parallel computation. In order to do so, parallel programming is required. Many programming languages already have a mechanism for parallel computing.

numo-narray is probably one of the areas in Ruby where Ractor will be used the most in the future. I think it is very important for future for Ruby that Ractor is available in numo-narray.

ping: If you don't mind, @ko1, could you take a look at this for us?

orlando-labs commented 2 years ago

Hi, @mrkn, @kojix2, as long as I'm using ractor-compatible branch for 2 months, I see no issues, except the performance ones, which are not relative to numo-narray

seoanezonjic commented 2 years ago

Hi all I'm very interested in this feature. Is it merged with the main branch or a new checking is needed to allow this code be in production? Thank you very much Pedro Seoane

mrkn commented 2 years ago

I'll contact with the owner.

darnellbrawner commented 1 year ago

Seeing similar slow down issues when using more than 2 Ractors and Numo. https://github.com/PlummersSoftwareLLC/Primes solution_2 uses Numo Single thread, Numo multithread using Ractor, and multithreaded no Numo used.