larcenists / larceny

Larceny Scheme implementation
Other
202 stars 32 forks source link

arithmetic slow with (integrate-usual-procedures #f) #288

Open larceny-trac-import opened 11 years ago

larceny-trac-import commented 11 years ago

Reported by: lth on Sat Dec 19 06:00:00 1998 088 v0.40 (981219 / lth) Priority: medium. Category: LIB / performance

When running with (integrate-usual-procedures #f), arithmetic is quite slow, because the definitions of '+' etc are coded in Scheme using rest args. Much better would be if they were written in MAL in the same way as read-char / write-char, special-casing 0, 1, 2, 3, and 4 arguments, or, even better, using case-lambda.

On vega:

   > (default-code)
   #t
   > (define (fib n) (if (< n 2) n (+ (fib (- n 1)) (fib (- n 2)))))
   fib
   > (run-with-stats (lambda () (fib 30)))
   Words allocated: 10770618
   Words reclaimed: 10738790
   Elapsed time...: 5863 ms (User: 5840 ms; System: 0 ms)
   Elapsed GC time: 174 ms (in 41 collections.)
   832040

In fact, the interpreter is only 50% slower:

   vega(197) % larceny -small
   Larceny v0.40 (precise:SunOS5:split) (lth 18-Dec-98 17:59:12)

   > (define (fib n) (if (< n 2) n (+ (fib (- n 1)) (fib (- n 2)))))
   fib
   > (run-with-stats (lambda () (fib 30)))
   Words allocated: 10770624
   Words reclaimed: 10733898
   Elapsed time...: 8376 ms (User: 8200 ms; System: 20 ms)
   Elapsed GC time: 85 ms (in 41 collections.)
   832040

This would be better:

   (define +
     (case-lambda
      (() 0)
      ((x) x)
      ((x y) (+ x y))
      ((x y z) (+ (+ x y) z))
      ((w x y z) (+ (+ (+ w x) y) z))
      ((x . rest) (let loop ((x x) (rest rest))
                    (if (null? rest)
                        x
                        (loop (+ x (car rest)) (cdr rest)))))))

By hacking together MAL code that does the above for + and -, we find that:

   > (fib-benchmark)   ; That's fib(30)

   --------------------------------------------------------
   Standard fib
   Words allocated: 474
   Words reclaimed: 0
   Elapsed time...: 2533 ms (User: 2520 ms; System: 0 ms)
   Elapsed GC time: 0 ms (in 0 collections.)

(`<' of 2 arguments does not cons anything, nor does it go out-of-line, because it takes 2 or more arguments and the no-extra-arguments case is in-lined and only stores () in a register.)

That isn't as good as Chez Scheme -- 1660 ms -- but it's a lot better than it was!

The time improves to 2300 ms with slightly smarter code, and to 2150 ms if we can avoid the stack frame creation, which may be possible with fundamental MAL support (so we don't have to free up a register temporarily).

We can do better by improving the speed of global procedure calls: using a function cell, caching global cells. These in themselves do not appear to be enough, though -- experiments with caching and unsafe-code does not reduce the time much from the above (only by about 100ms).

Fact: the following (illegal) code for + and - results in a run time of 2300 ms:

   (define + (lambda (a b . rest) (if (null? rest) (+ a b) ...)))
   (define - (lambda (a b . rest) (if (null? rest) (- a b) ...)))

Possible MAL support:

 (args-switch ((2 1002) (0 1000) (1 1001) (3 1003) (* 1004)))

which says go to 1002 if 2 args, 1000 if 0, ..., and to 1004 if none of the above. Does not destroy RESULT. Ordering can be used as an optimization hint but has no semantic meaning. It would be reasonable to limit the argument count in the non-* case to some fairly small number.

[Numbers to be taken with a grain of salt.]

A caching strategy for 'known' globals, like that used by the interpreter, helps. Of course, it bloats the code somewhat, but it appears that the code size increase is linear if one is careful about lifting common expressions. Chez Scheme gets good performance apparently w/o doing this, because code that uses 'plus' for '+' runs as fast as code that uses '+', at optimize-level 1.

(See file Lib/Common/arith.mal for sample implementation when fixing this.)

WillClinger commented 9 years ago

The R6RS and R7RS standards have made this even less urgent. By now, I doubt whether anyone cares about Larceny's performance in strict R5RS mode.