larcenists / larceny

Larceny Scheme implementation
Other
202 stars 32 forks source link

is bignum performance holding up pi benchmark? #414

Open larceny-trac-import opened 11 years ago

larceny-trac-import commented 11 years ago

Reported by: pnkfelix on Mon Mar 5 13:41:41 2007 An email from Aziz. (We should check if these results are consistent across platforms, or if its only a problem on IasnLarceny...)

On Mar 5, 2007, at 12:07 PM, Felix S Klock II wrote:

>
> On Mar 5, 2007, at 3:19 AM, Abdulaziz Ghuloum wrote:
>
>> Greetings Felix,
>>
>> I was going through some of the benchmark files which are included in
>> the larceny source distribution.  I noticed that the pi benchmark  
>> does
>> not terminate on my machine (a MacBook).  I waited a few minutes only
>> before giving up since it terminates in less than 2 seconds under  
>> some
>> implementations.  The command I used is:
>>
>> $ ./bench -s r6rs larceny pi
>>
>> Any ideas?
>
> On the development version of Larceny, on a Linux Intel 2.5 Ghz P4,  
> running the pi benchmark as given above takes ~5.4 minutes.
>
> Which implementations take less than 2 seconds?  Chez?  Or . . .?

Actually, I don't have chez on my machine.  But here are the times for
petite-chez, ikarus, mzscheme, and gambit (on 2GHz core2duo macbook):

  $ ./bench -s r6rs petite-chez pi
Testing pi under Petite-Chez-Scheme-r6rs
     943 ms elapsed cpu time, including 11 ms collecting
     954 ms elapsed real time, including 12 ms collecting

$ ./bench -s r6rs ikarus pi
Testing pi under Ikarus-r6rs
     1659 ms elapsed cpu time
     1661 ms elapsed real time

$ ./bench -s r6rs mzscheme pi
Testing pi under MzScheme-r6rs
cpu time: 2077 real time: 2076 gc time: 1421

$ ./bench -s r6rs gambit pi
Testing pi under Gambit-C-r6rs
     2445 ms real time
     2443 ms cpu time (2423 user, 20 system)

> Presumably something is wrong with our bignum procedures if there  
> is such a huge performance difference between the systems.  (Or  
> perhaps some compiler analysis is allowing more fixnum operations  
> to be used directly in the fast implementation.)

I think the former,  given that petite-chez (an interpreter) is the  
fastest of the surveyed systems!

Aziz,,,
larceny-trac-import commented 11 years ago

Author: pnkfelix Will provides the following response (via email):


Interpreted systems are often faster than compiled systems
on benchmarks that measure nothing but the speed of library
procedures.  Larceny's bignums are among the slowest of any
system that supports bignums.  On artichoke, for example,
here is how long it takes to compute (fact 10000):

Petite:     .41 seconds
Scheme48:   .47
Gambit:     .53 seconds
MzScheme:   .98 seconds
Kawa:      1.0
Larceny:  18.81 seconds
SCM:     blows up
Bigloo:  returns 0
Chicken: returns +inf.0

If pi uses even larger bignums than (fact 10000), then it
doesn't surprise me a bit that Larceny is 200 times as slow
as some of those other systems.

Will

So I guess this was a known issue in Larceny...

larceny-trac-import commented 11 years ago

Author: pnkfelix This is not a priority for us, at least at the moment.

But hey, maybe we could get an mathematically inclined undergrad to spend some time implementing: [http://numbers.computation.free.fr/Constants/Algorithms/fft.html FFT based multiplication]

larceny-trac-import commented 11 years ago

Author: will I'm reopening this ticket, because this really is a performance bug we should fix.

The (fact 10000) benchmark shows we're 20-40 times as slow as we should be when multiplying a bignum by a fixnum.

The pi benchmark paints a more complex picture. With the usual inputs, the pi benchmark performs the following multiplication and division operations, which take the indicated number of milliseconds (excluding gc time) on a 1.5 GHz SPARC in Chez Scheme 6.1 and in Larceny v0.95:

                                                       Chez    Larceny  ratio
  16048 multiplications of a fixnum by a fixnum          10       2400
1133612 multiplications of a bignum by a fixnum        4710      92000     20
      2 multiplications of bignums with one >= 2^32       0          0
      8 multiplications of bignums with one >= 2^64       0          5
    468 multiplications of bignums with one >= 2^128      0        301
   6200 multiplications of bignums with one >= 2^256    750     176000    235
   6662 divisions                                       730     172000    235

Multiplication of a bignum by a fixnum is slow because Larceny converts the fixnum to a bignum and then performs a bignum multiplication. Multiplication of large bignums is slow (in part) because Larceny uses classical algorithms, which take time proportional to the product of the lengths of the two bignums. Division of bignums is slow because multiplication is slow and because Larceny uses the classical algorithm.

A quick-and-dirty implementation of Karatsuba's algorithm reduced the multiplication time by about 50 seconds.

All of these things can be improved during the arithmetic conversion.

larceny-trac-import commented 11 years ago

Author: will Larceny's bignums had been limited to less than 224 bits. This limit was increased to 232 bits by changeset:5759, changeset:5760, and changeset:5761.

The changeover to R6RS arithmetic semantics is nearly complete, so it is time to look into WhyBignumsAreSlow.

WillClinger commented 9 years ago

Changeset fb7a363c5e21860887d5544d51ac900bc45e8b61 made fixnum multiplication about six times as fast as before, yielding a factor of two improvement on the pi benchmark and a factor of about 2.5 on the (fact 10000) benchmark.

That's from changing the millicode. I'm about to see what happens when the faster code is generated inline.

WillClinger commented 9 years ago

Changeset c16e8985029b0442e69a26fe687a26eae1f204ce undid the millicode changes while adding inline multiplication of fixnums to the IAssassin code generator.

On the machine I've been using for benchmarks, here is how long it takes to compute (fact 10000):

System Time
Vicare .01 seconds
Gauche .06 seconds
Chicken (R7RS) .12 seconds
Racket .45 seconds
Chibi .46 seconds
Petite Chez .65 seconds
Larceny (now) .75 seconds
Larceny v0.98 2.6 seconds
WillClinger commented 9 years ago

Recent changes through 281e64e93d10baeb0a998d76cef2a316a9abfc8b have improved IAssassin bignum performance by enough to make the pi benchmark run three times as fast as in v0.98.

Larceny's bignum arithmetic remains slow. On the usual benchmark machine, here's how long it takes to compute (fact 100000):

System Std Time in Seconds
Vicare R6RS 1.1
Racket R6RS 5.7
Petite Chez R6RS 7.6
Gauche R7RS 7.8
Chicken R7RS 8.1
Chibi R7RS 9.9
Larceny (now) R7RS 20.1
Larceny v0.98 R7RS 335.5

Here's how long it takes to compute (fib 100000) with a linear number of additions:

System Std Time in Seconds
Vicare R6RS .04
Racket R6RS .2
Petite Chez R6RS .2
Gauche R7RS .25
Chicken R7RS .8
Larceny (now) R7RS 2.2
Chibi R7RS 4.3
Larceny v0.98 R7RS 5.1

So there's still room for improvement. The Great Primop Cleanup will help some. More radical steps should be deferred until after that cleanup.