Serious performance regression for method pow(a, m)

jzakiya commented 2 months ago

The code below shows this issue. On Truffleruby, using pow(a, m) is orders of magnitude slower than on [C|J]Ruby, where it's the fastest.

# Enable YJIT if using CRuby >= 3.3"
RubyVM::YJIT.enable if RUBY_ENGINE == 'ruby' and RUBY_VERSION.to_f >= 3.3

def modinv(a0, m0)
  return 1 if m0 == 1
  a, m = a0, m0
  x0, inv = 0, 1
  while a > 1
    inv -= (a / m) * x0
    a, m = m, a % m
    x0, inv = inv, x0
  end
  inv += m0 if inv < 0
  inv
end

def modinv1(r, m)
  inv = r
  while (r * inv) % m != 1; inv %= m; inv *= r end
  inv % m
end

def modinv2(r, m)
  inv = r
  while (r * inv) % m != 1; inv = (inv * r) end
  inv #% m
end

def modinv3(r, m, rescnt)
  (r ** (rescnt-1)) % m
end

def modinv4(r, m, rescnt)
  r.pow(rescnt-1, m)
end

def tm; t = Time.now; yield; Time.now - t end

resp7 = [11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,
          97,101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,
         167,169,173,179,181,187,191,193,197,199,209,211]

print resp7.map { |r| modinv(r, 210) }
puts
print resp7.map { |r| modinv1(r, 210) }
puts
print resp7.map { |r| modinv2(r, 210) }
puts
print resp7.map { |r| modinv3(r, 210, 48) }
puts
print resp7.map { |r| modinv4(r, 210, 48) }
puts

puts tm{ 100000.times {resp7.each { |r| modinv(r, 210) } } }

puts tm{ 100000.times {resp7.each { |r| modinv1(r, 210) } } }

puts tm{ 100000.times {resp7.each { |r| modinv2(r, 210) } } }

puts tm{ 100000.times {resp7.each { |r| modinv3(r, 210, 48) } } }

puts tm{ 100000.times {resp7.each { |r| modinv4(r, 210, 48) } } }

andrykonchin commented 2 months ago

Thank you for reporting the issue!

Could you please share your benchmarking results?

eregon commented 2 months ago

Which version of TruffleRuby are you using?

Related: https://github.com/oracle/truffleruby/issues/1999 We do modular exponentiation now: https://github.com/oracle/truffleruby/blob/51b497f921dacf71441f568ed653a6c5799d2ff0/src/main/ruby/truffleruby/core/integer.rb#L140 But always convert to BigInteger, maybe that's the overhead? If that's the case we could use similar logic to CRuby/JRuby for the small integers cases, but would be good to make sure that's the issue and how much slower it is before importing that code.

jzakiya commented 2 months ago

First code correction.

def modinv2(r, m)
  inv = r
  while (r * inv) % m != 1; inv = (inv * r) % m end
  inv #% m
end

Here are times on my system for Ruby 3.3 w/wo YJIT, JRuby 9.4.6.0 and TruffleRuby 24.0.0, via rvm. Total times in secs, smaller the faster.

Ruby 3.3 w/o YJIT
1.544888113
1.054920322
0.954105064
3.531902812
0.369263064

Ruby 3.3 w/YJIT
0.375549548
0.383313453
0.36227265
3.213001315
0.235034272

JRuby 9.4.6.0
0.9543389999999999
0.99756
0.870005
1.6784890000000001
0.266971

TruffleRuby 24.0.0
0.222029
0.33000100000000004
0.19171500000000002
2.120432
4.006869

eregon commented 2 months ago

With a slightly modified script which:

uses Benchmark.realtime to use a monotonic clock
reports 3 timings for each method (to get an idea if things are warmed up or not yet, and get an idea of the error in the measurements)
uncomments the #% m for modinv2, otherwise the result differs from other methods
cleans up the check for RubyVM::YJIT.enable

require 'benchmark'

# Enable YJIT if using CRuby >= 3.3"
RubyVM::YJIT.enable if defined?(RubyVM::YJIT.enable)
puts RUBY_DESCRIPTION

def modinv(a0, m0)
  return 1 if m0 == 1
  a, m = a0, m0
  x0, inv = 0, 1
  while a > 1
    inv -= (a / m) * x0
    a, m = m, a % m
    x0, inv = inv, x0
  end
  inv += m0 if inv < 0
  inv
end

def modinv1(r, m)
  inv = r
  while (r * inv) % m != 1; inv %= m; inv *= r end
  inv % m
end

def modinv2(r, m)
  inv = r
  while (r * inv) % m != 1; inv = (inv * r) % m end
  inv % m
end

def modinv3(r, m, rescnt)
  (r ** (rescnt-1)) % m
end

def modinv4(r, m, rescnt)
  r.pow(rescnt-1, m)
end

def tm(&b)
  puts
  3.times do
    puts Benchmark.realtime(&b)
  end
end

resp7 = [11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,
          97,101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,
         167,169,173,179,181,187,191,193,197,199,209,211]

print resp7.map { |r| modinv(r, 210) }
puts
print resp7.map { |r| modinv1(r, 210) }
puts
print resp7.map { |r| modinv2(r, 210) }
puts
print resp7.map { |r| modinv3(r, 210, 48) }
puts
print resp7.map { |r| modinv4(r, 210, 48) }
puts

tm{ 100000.times {resp7.each { |r| modinv(r, 210) } } }

tm{ 100000.times {resp7.each { |r| modinv1(r, 210) } } }

tm{ 100000.times {resp7.each { |r| modinv2(r, 210) } } }

tm{ 100000.times {resp7.each { |r| modinv3(r, 210, 48) } } }

tm{ 100000.times {resp7.each { |r| modinv4(r, 210, 48) } } }

I get:

On Intel:
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [x86_64-linux]
1.347028863034211
1.3474754459457472
1.3452114629908465

1.2710674100089818
1.2754993910202757
1.2733817050466314

1.3521089899586514
1.351813138986472
1.3528226349735633

7.390963806945365
7.495342062029522
7.4823689730255865

0.8994149370118976
0.8945350690046325
0.9004962909966707

truffleruby 24.0.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
0.3218597859959118
0.24441085202852264
0.21928530297009274

0.20125181798357517
0.18273464200319722
0.1594632159685716

0.1859166320064105
0.16585971100721508
0.13765042601153255

4.247611464001238
4.127755031979177
4.079420825000852

8.350773187004961
8.356869303970598
8.30608620395651

On AMD Ryzen 7 3700X 8-Core Processor
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [x86_64-linux]
0.6688195530005032
0.6663745859996197
0.6647927419999178

0.7047924210000929
0.7027385810006308
0.70747054200001

0.6540912820000813
0.6548124430000826
0.6545112009998775

5.6871033949992125
5.716956408999977
5.703890113999478

0.46341663200018957
0.46297978599977796
0.46605407200058835

truffleruby 24.0.1, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
0.2904605989997435
0.207193976999406
0.18607605099987268

0.1610029299999951
0.13540141500016034
0.11250887000005605

0.13582420100010495
0.10616527599995607
0.08965392699974473

2.6358709440000894
2.63863470600063
2.630714566000279

5.454730677999578
5.439872395000748
5.455230954999934

truffleruby 24.0.1, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
0.30870855699959066
0.19738048799990793
0.18506349099970976

0.17732479499954934
0.12824568500036548
0.12031344499973784

0.13241845700031263
0.09785182599989639
0.09070625899948936

2.0937121340002705
1.7953665820004971
1.7865803080003388

2.732251799000551
2.6282954050002445
2.6161076440002944

I noticed YJIT is much slower in my measurements, maybe you are on a different architecture, I guess darwin-aarch64?

But overall it's similar results, TruffleRuby is fastest everywhere except for the last method, modinv4. You mentioned regression, do you mean a regression compared to an older version of TruffleRuby, or you mean regression as "slower than CRuby/JRuby"?

Either way modinv4 amounts to (r between 11 and 211).pow(47, 210) so that's all small integers as inputs, so we should optimize that case too like CRuby/JRuby, it seems the java.math.BigInteger#modPow code doesn't optimize that case well unfortunately.

eregon commented 2 months ago

@andrykonchin Could you port https://github.com/jruby/jruby/blob/9.4.4.0/core/src/main/java/org/jruby/RubyInteger.java#L935-L944 to TruffleRuby? (or from CRuby, but probably more work)

jzakiya commented 2 months ago

My system is a Lenovo Legion Slimjet laptop (2022), AMD Ryzen 9 5900HX, with Linux.

The issue is that pow, which is used in modinv4, is significantly slower in Truffleruby than in [C|J]Ruby. In [C|J]Ruby using pow is the fastest among different implementations, but is slowest in Truffleruby.

eregon commented 2 months ago

Fixed in https://github.com/oracle/truffleruby/pull/3552

Now modinv4 is as fast as modinv1 and modinv2:

On AMD Ryzen 7 3700X 8-Core Processor

truffleruby 24.1.0-dev-52596ca3, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
0.35175757099932525
0.21934590799719444
0.19075998200059985

0.20475720999820624
0.13995916100247996
0.11859457800164819

0.16480916699947556
0.10902587799864705
0.09228703799817595

2.012781329001882
1.7254886529990472
1.7228568719983741

0.21183235100033926
0.13329798599806963
0.11889045199859538

Same but running 5 times to ensure enough warmup since this is on jargraal and not libgraal:
truffleruby 24.1.0-dev-52596ca3, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
0.36994781300018076
0.221259508998628
0.18763062200014247
0.19057154599795467
0.18439673299872084

0.19098085299992817
0.1328596249986731
0.11979981799959205
0.11720118900120724
0.11687311399873579

0.1548915470011707
0.10623463299998548
0.10084365199872991
0.09816986200166866
0.09051773600003798

2.040468414998031
1.7358531079989916
1.755651747000229
1.6976114799981588
1.7026588670014462

0.22171137699842802
0.13052473000061582
0.11817796299874317
0.11785270799737191
0.11395291900043958

As a bonus the new code is written in Ruby (which in this case is a lot faster than Java's BigInteger#modPow!) and IMO looks a lot cleaner than the equivalents in CRuby / JRuby.

oracle / truffleruby

Serious performance regression for method pow(a, m) #3544