guicho271828 / inlined-generic-function

Bringing the speed of Static Dispatch to CLOS. Succeeded by https://github.com/marcoheisig/fast-generic-functions
108 stars 6 forks source link

Benchmark behaviour on defined class is slow. #3

Open bon opened 8 years ago

bon commented 8 years ago

The benchmarks provided are for methods on the built-in lisp types number, fixnum and double-float. To test the behaviour on defined classes we added a simple boxing class and found that peformance degraded when using inlined-generic-functions, inlined. We found the following numbers of processor cycles for the four methods in playground.lisp, respectively:

     333,033
     331,839
   2,144,814
     585,272

Experiment on sbcl 1.3.5.24

See https://github.com/bon/inlined-generic-function/commit/8b6e4d5b10cace47de4343e6dde8455f21dfd579

So my question is whether this indicates that inlined-generic-functions only speed up on built-in types and not on defined classes?

guicho271828 commented 8 years ago

it seems normal-plus is running w/o boxing, right?

bon commented 8 years ago

Correct! Fixed in https://github.com/bon/inlined-generic-function/commit/76d1eb6e77ebc5433465b9afb2cdb84b6c4c3e4d

Processor cycles are now

    588,650
    586,253
  1,889,394
    550,351
guicho271828 commented 8 years ago

phew.

guicho271828 commented 8 years ago

I just tested your version. On my machine, the result is still in favor of the inlined version.

Evaluation took:
  0.001 seconds of real time
  0.004000 seconds of total run time (0.004000 user, 0.000000 system)
  400.00% CPU
  638,640 processor cycles
  131,024 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  608,634 processor cycles
  163,808 bytes consed

Evaluation took:
  0.003 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  4,543,020 processor cycles
  655,184 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  389,169 processor cycles
  163,808 bytes consed

What is this difference? In your result I-g-function is performing better, but not much better. I use SBCL 1.3.8 on roswell on

$ uname -a
Linux guicho-x61 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/cpuinfo
...
model name  : Intel(R) Core(TM)2 Duo CPU     T7100  @ 1.80GHz
...
bon commented 8 years ago

For me the numbers of cycles vary wildly from run to run. Sometimes the igf gets a little quicker, sometimes slower. One example is shown below.

But the more interesting question is why the igf showed a 10x speedup on numbers but hardly any difference on defined classes? Of course I would be very happy to see a 10x speedup on defined classes too!

$ cat /proc/cpuinfo  | ag 'model name' | head -1
model name  : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
$ uname -a
Linux tie 4.7.2-1-ARCH #1 SMP PREEMPT Sat Aug 20 23:02:56 CEST 2016 x86_64 GNU/Linux
$ ros use sbcl
$ ~/.roswell/impls/x86-64/linux/sbcl/1.3.9/bin/sbcl --version
SBCL 1.3.9
$ ros run
$ rlwrap ros run
* (ql:quickload :inlined-generic-function)

...

* (load "benchmark.lisp")

...

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  424,334 processor cycles
  131,024 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  362,358 processor cycles
  163,792 bytes consed

Evaluation took:
  0.001 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  2,060,160 processor cycles
  655,200 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.003333 seconds of total run time (0.003333 user, 0.000000 system)
  100.00% CPU
  493,287 processor cycles
  163,792 bytes consed
guicho271828 commented 8 years ago

the reason of not achieving 10x speedup is due to the type information and the cost of slot access.

  1. The contents slot of box is not typed, so the (+ (contents a) b) part is always calling a generic-+, not the optimized machine assembly. You should check the disassembly result.
  2. The accessor contents is a normal generic function. So the slot access is slow.

Imagine the total cost is 10X for normal GF and X for IGF. Above two factor adds two overheads, resulting in 10X+A+B vs X+A+B. Then obviously 10 times speedup is not achievable since A+B could be very large.

guicho271828 commented 8 years ago

I updated the environment and noticed that the examples in playground.lisp getting slow. It looks like the function is prevented from inlining.

guicho271828 commented 8 years ago

(push :inline-generic-function *features*) still successfully forces the functions being inlined, but I don't like this solution...