Performance regression - Githubissues

GoogleCodeExporter commented 9 years ago

Several operations appear to take far too long when the system is built now
using open source.  Two things that appear to have around a 10x slowdown!:

- reading the bst file
- running Mandelbrot with optimized floats

However, many other things appear to be about the same speed, so something
is mysterious.  Aligment problems?

Original issue reported on code.google.com by David.Gr...@gmail.com on 25 Nov 2006 at 6:33

GoogleCodeExporter commented 9 years ago

i made a build of strongtalk using visual C++ 6. ( see build details below )

 loading the image takes around 100 msec with the original binary ("Strongtalk-2.0")
 running the mandelbrot test with optimized (unboxed) floats takes:
  10 sec without recompilation
   1 sec with recompilation

 loading the image takes around 35 msec with my binary
 running the mandelbrot test with optimized (unboxed) floats takes:
  10 sec without recompilation
   1 sec with recompilation

 so i couldn't reproduce the difference you measured with the Mandelbrot, and my
 build is actually faster to load the image. however using vs2005, i could reproduce
 the slow image loading: it was taking around 10x more time.
 so i suppose the issue is clearly related (and probably specific to) vs2005

 loading the image takes around

Build details for visual C++ 6

 starting with a fresh source:
  made the .dsp project for all files using a script
  compiled compiled makeDeps
  used makeDeps like this:

   copy includeDB + includeDB2 includeDB.all
   makeDeps.exe makedeps-platform.txt includeDB.all

  (output of makeDeps in /deps/incls)
  in the .dsp project: 
   added all source directories as include directories
   added PRODUCT,DELTA_COMPILER,MICROSOFT in defines

  visual C++ 6 will complain about new operators for two classes:
   ResourceObj (allocation.hpp)
   PIC (compiledPIC.hpp)

  to avoid warnings, i added dummy operators to make the compiler happy:
    void  operator delete(void* p, int) {} 

  to the two classes.
  exluded process_asm.cpp from the build
  these changes were enough to compile everything
  for linking, all i had to add is the st_asm.lib files
  and it works. 

  strongtalk.exe 948k

  i'm impressed by how easy to was to build on visual C++ 6 ... i usually encouter
  more resistance. thanks ^^

Original comment by prunedt...@gmail.com on 24 Mar 2007 at 9:30

GoogleCodeExporter commented 9 years ago

Thanks prunedtree,  that helps isolate the problem.  It is a good thing that it
appears to be due to VC++ rather than some unknown source difference between the
original binary from Sun and the open source.

- It would be very helpful if you could also report your machine CPU specs and 
also
numbers from a run of the original binary from Sun, which is what I was 
comparing
with.  The original is *not* 2.0, it is the binary from any version pre-2.0.  

- As for the apparent unboxed floating point regression, I took another look at 
that.
 It may be that the issue is that C++ floating point got faster, rather than
Strongtalk got slower.  At the moment, I am getting ~726ms on both Strongtalk
versions for unboxed floats, whereas "Call C" takes 801ms under VS 6, but only 
303ms
under VS 2005.  So originally Strongtalk was actually a bit faster at (this) 
floating
point than C++, which we figured was due to improperly aligned doubles in C++; 
it
looks like they have fixed this.  (One other possibility was that they expected 
the
stack to be double-word aligned on entry to the C++ routine, I am not sure 
whether
Strongtalk ensures double word alignment for callouts; in which case it might 
appear
intermittent based on the current stack alignment at the time of the callout).

- So, to summarize, at the moment it appears that there is a real regression in 
image
loading caused by some VC++ difference, but at the moment it appears that the
floating point regression may not be real.

Original comment by David.Gr...@gmail.com on 24 Mar 2007 at 10:08

GoogleCodeExporter commented 9 years ago

My CPU is an AMD athlon XP 2800+ (2079 mhz)

- With r36 compiled with vc6

image loading : 36 ms

mandelbrot: 
unboxed floats, C : 453 ms
unboxed floats, interpreted: 10966 ms
unboxed floats, compiler on: 1109 ms
boxed floats, interpreted: 19694 ms
boxed floats, compiler on: 6071 ms

- With binary from Strongtalk-1.1.2

image loading: 36 ms

mandelbrot: 
unboxed floats, C : 752 ms
unboxed floats, interpreted: 10754 ms
unboxed floats, compiler on: 1096 ms
boxed floats, interpreted: 18628 ms
boxed floats, compiler on: 5753 ms

Mainly a difference in the C performance, nothing surprising regarding the 
widely
different performance of x87 code depending on compilers and compiler settings.

Regarding alignement, IIRC the x86 C ABI enforces (32 bit) word alignement only.

Original comment by prunedt...@gmail.com on 31 Mar 2007 at 1:15

GoogleCodeExporter commented 9 years ago

Btw, the differences in bootstraping speed are closely linked to the LIBC in 
use.
Multithreaded CLIB is roughly 4x slower for instance, from 35 ms to 150 ms...

I think this issue can be considered solved.

Original comment by prunedt...@gmail.com on 30 Apr 2007 at 7:31

GoogleCodeExporter commented 9 years ago

prunedtree:

Why do you think this is considered solved?  I had thought about this too when I
noticed this bug, and tried all the other library options.  Although there were 
speed
differences, I didn't find any that got anywhere close to the speed of the 
original
executable, and I don't see any options in VC++ 2005 for non-multithreaded 
libraries.

How do you get it to use non-multithreaded libraries?

Original comment by David.Gr...@gmail.com on 1 May 2007 at 12:31

GoogleCodeExporter commented 9 years ago

The single-threaded versions of the libraries have been discontinued, because 
the
performance of the multi-threaded versions is "close" to that of the 
single-threaded
versions. One thing that helps speed up the bootstrap is to use the non-locking
functions to read the stream.

When I replaced references to getc with _getc_nolock, this improved bootstrap 
from
~200ms to 75ms on my machine (Intel Quad Core Q6700 @ 2.66GHz). For reference, 
the
1.0 release bootstraps in 39ms.

Using the non-debug versions of the CRT reduced bootstrap further to around 
65ms. I
think that is probably as close as we are going to get right now.

Unfortunately, _getc_nolock is a windows specific function, so we will either 
have to
put it in os:: or put in conditional compilation in bootstrap.cpp.

Original comment by StephenL...@gmail.com on 12 Aug 2008 at 9:41

GoogleCodeExporter commented 9 years ago

Well, it seems pretty obvious that I/O is the bottleneck, so I think simple 
buffering
(using fread) will solve the issue, and it's more portable

Original comment by prunedt...@gmail.com on 13 Aug 2008 at 5:54

GoogleCodeExporter commented 9 years ago

It's not so much the IO itself, but the locking in the multi-threaded 
libraries. On
further reading another, better alternative would be to compile with the locks 
turned
off by defining _CRT_DISABLE_PERFCRIT_LOCKS. With this defined, and having 
reverted
_getc_nolock to the portable getc the system bootstraps in 23ms which is 
actually
faster on my system than the original 1.0 release. Clearly there were other 
locks in
the CRT that were inhibiting bootstrap performance.

Subject to no other problems emerging from the lack of the use of locks in the 
CRT,
which should be minimal, since the system is effectively pretty much single 
threaded
at the moment (with a few notable exceptions), I think this pretty much 
resolves this
issue.

As far as Mandelbrot goes, my 10-run average figures are as follows. All are 
with the
compiler turned on and for 500 iterations.

For the 1.0 release

Optimised floats   220.6ms
Boxed floats       898.7ms
C                  328.9ms

For the current release with the above locking fix

Optimised floats   231.0ms
Boxed floats      1093.9ms
C                   95.9ms

This shows an 18% degradation in the boxed float performance, a 5% degradation 
in
optimised float performance and a massive 243% improvement in C performance.

I also performed the same test with the lock fix from above turned off. This 
didn't
make much difference. For reference the figures are

Optimised floats   226.6ms
Boxed floats      1088.9ms
C                   93.0ms

On this evidence the C performance has improved substantially, while Strongtalk 
has
been pretty much static, or slightly regressed.

Original comment by StephenL...@gmail.com on 13 Aug 2008 at 10:45

GoogleCodeExporter commented 9 years ago

for mandelbrot, it's related to how much your compiler optimizes x87 code, 
that's all.  For slower boxed floats I'd 
bet that it's the way the GC is compiled that creates the difference. (indeed, 
the strongtalk compiler doesn't care 
about your C++ compiler at all. well, as long as it's bugfree that is...)

Original comment by prunedt...@gmail.com on 27 Aug 2008 at 8:53

GoogleCodeExporter commented 9 years ago

Marking this as fixed. There is no degradation in the Mandelbrot performance - 
just
improvement in the performance of the equivalent C code when compiled on a 
modern
compiler.

The image bootstrap issue has been fixed by disabling locking in the C runtime 
when
compiling the VM.

Original comment by StephenL...@gmail.com on 19 Dec 2009 at 2:19

Changed state: Fixed

rsdoiel / strongtalk

Performance regression #29