There is a static initializer generated in util/comparator.cc

GoogleCodeExporter commented 9 years ago

Static initializers are totally fine in 99% of the projects. However in Chrome 
we are trying to remove them as they significantly slow down startup due to 
disk seeks.

There is only one static initializer generated by leveldb:
$ nm libleveldb.a|grep _GLOBAL__I
0000000000000050 t _GLOBAL__I__ZN7leveldb10ComparatorD2Ev
$

A global instance of BytewiseComparatorImpl is created at static initialization 
time in util/comparator.cc:

// Intentionally not destroyed to prevent destructor racing
// with background threads.
static const Comparator* bytewise = new BytewiseComparatorImpl;

const Comparator* BytewiseComparator() {
  return bytewise;
}

I tried to make BytewiseComparator() CreateBytewiseComparator() instead so that 
it returns a new instance every time it is called. But then I'm encountering 
some ownership issues when it is used in the Options class. I initially made 
Options call CreateBytewiseComparator() in its constructor and delete it in its 
destructor (I also provided the correct implementations of copy 
constructor/assignment operator). The thing is that the comparator must live 
longer than the Options instance which owns it since the client seems to still 
use the pointer after Options goes out of scope.

Therefore I was also thinking about a totally different approach and wanted to 
add atomicops and CallOnce (GoogleOnceInit) from V8 to leveldb. That way we can 
keep BytewiseComparator() as it is and initialize the global instance the first 
time it is used.
Adding all these dependencies might seem overkill. This is why I'm not  
directly sending a CL to you. They might serve you later though.

What do you think?

Original issue reported on code.google.com by pli...@google.com on 13 Mar 2012 at 10:14

GoogleCodeExporter commented 9 years ago

Ping :)

Original comment by pli...@google.com on 17 Apr 2012 at 3:16

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I sent this response on March 13th.  I think you missed
it because for some reason email responses go just to
the leveldb@ mailing list and you might not be subscribed
to that:

---
I am not a huge fan of adding more porting cruft to leveldb.

How many nanoseconds is this particular global initialization adding to
chrome startup?  (You should be able to estimate it by a quick look
at the disassembly and taking a microbenchmark of malloc cost.)
And how does that compare just the cost of just paging in the leveldb code
into memory?

Original comment by san...@google.com on 17 Apr 2012 at 6:09

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

You know, if you really want to get disk seeks down a pretty good strategy 
would be to precompute out all the parts of files (shared libraries or 
otherwise) that you *know* need to be touched during startup. Then before doing 
anything else on startup spawn a thread/process that sends the whole lot to the 
kernel in one feel swoop either as a a giant iovec or an mlockall (the latter 
assumes a separate address space from your app). Sure it overwhelms the 
kernel/storage controllers/devices with a huge set of IO requests at once but 
they'll amortize the seek overhead as best as possible and fill the buffer 
cache with everything you'll need for the rest of startup, avoiding the 
unfortunate sequential access that is all too commonly a part of process 
startup. It sure would make attempting to remove individual static initializers 
even more of a pointless exercise than it is now.

Original comment by cbsm...@gmail.com on 18 Apr 2012 at 6:30

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I have worked with Philippe on the removal of static initializers for Chrome on 
Android. Please allow me to share some insight.

The problem with static initializers is two-fold:

1/ First, they make loading the Chrome native library slower. This is not so 
much due to the code that needs to run, but the page faults that they requires. 
For each one of them, you would typically have:

  - A page fault required to load the corresponding .text section page from disk to memory. Believe it or not, this is still very slow on embedded systems despite the use of Flash memory.

  - Another page fault required when touching the .bss section's page corresponding to the data that is initialized. This one is really minor in terms of performance, but still exist (and affects overall RAM usage in a small way).

We started removing all static initializers from Chrome on Android last year. 
At the time, there were more than 350 entries in the .init_array section (each 
entry corresponding to all the initializers in a single source file), and 
profiling showed that running the corresponding code took 800ms on a cold 
start, and 80ms on a warn one.

The difference (720ms) was entirely due to the i/o page faults. On average, 
that's
 about 2ms of startup latency per initializer, or 2 million nano-seconds. This is on a device that only uses flash memory, so there are no disk-seek latencies involved. However, this is probably a lower bound due to the fact that we reorder the .text segment (see below).

We have now removed most static initializers from our code base, using various 
techniques. It is often possible to replace the statically-initialized data 
with real constants, otherwise just delaying the initialization on first use is 
what we use.

We can't use static local C++ initialization, because it is not thread-safe on 
Windows, and hence disabled explicitely when building Chrome with GCC's 
-fno-threadsafe-statics.

Also note that we also re-order the .text segment to pack all functions that 
are run at startup on contiguous pages. This helps save RAM footprint and 
reduce the number of i/o page faults that occur during library load and 
initialization, but still the impact is significant.

We also experimented with a scheme similar to what cbsmith described, but this 
didn't help. It actually made things worse on some devices.

Our work has been developed internally at Google, and we're now open-sourcing 
all of it by sending patches to the respective upstream projects.

We can provide a patch that performs the initialization of the object lazily 
inside BytewiseComparator().

It looks like port/port.h contains AtomicPointer and Mutex classes that allow 
the implementation of thread-safe and fast lazy-initialization, so this would 
not require additionnal dependencies as initially suggested by Philipped.

Original comment by di...@google.com on 20 Apr 2012 at 5:19

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

> It looks like port/port.h contains AtomicPointer and Mutex classes that allow 
the implementation of thread-safe and fast lazy-initialization, so this would 
not require additionnal dependencies as initially suggested by Philipped.

That would be fine, but I don't see how that would work without
adding a static initializer for a global Mutex object to control
the lazy initialization.

Original comment by san...@google.com on 20 Apr 2012 at 5:23

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I sent this to the mailing list, but it seems the commenting system the right 
place to place it. I'll add that I've done this kind of thing before myself, 
and even if you eschew iovectors and mlockall and go with a series of 1 byte 
reads to trigger the paging you should be able to get those 800ms quite close 
to 80ms:

You know, if you really want to get disk seeks down a pretty good strategy 
would be to precompute out all the parts of files (shared libraries or 
otherwise) that you *know* need to be touched during startup. Then before doing 
anything else on startup spawn a thread/process that sends the whole lot to the 
kernel in one fell swoop either as a a giant iovec or mlockall (the latter 
assumes a separate address space from your app). Sure it overwhelms the 
kernel/storage controllers/devices with a huge set of IO requests at once but 
they'll amortize the seek overhead as best as possible and fill the buffer 
cache with everything you'll need for the rest of startup, avoiding the 
unfortunate sequential access that is all too commonly a part of process 
startup. It sure would make attempting to remove individual static initializers 
even more of a pointless exercise than it is now.

Original comment by cbsm...@gmail.com on 21 Apr 2012 at 2:51

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Damn, that's true the mutex needs to be initialized to.

I can think of two ways to solve this though:

1/ Define a "StaticMutex" class that can be POD-initialized. This is trivial on 
Posix systems with PTHREAD_MUTEX_INITIALIZER. For other systems, keep the 
original initialization code.

2/ Do not use a mutex at all, but a custom spinlock. That's what Chromium's 
LazyInstance<> templace does, but this requires an atomic compare-and-switch 
function + a thread yeld function.

Solution 1/ would definitely be simpler because it's slightly more portable. 
What do you think about this?

Original comment by di...@google.com on 22 Apr 2012 at 9:47

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

> 1/ Define a "StaticMutex" class that can be POD-initialized. This is
> trivial on Posix systems with PTHREAD_MUTEX_INITIALIZER. For other
> systems, keep the original initialization code.
>
> 2/ Do not use a mutex at all, but a custom spinlock. That's what
> Chromium's LazyInstance<> templace does, but this requires an atomic
> compare-and-switch function + a thread yeld function.

I am not wild about either of these solutions.  They add to
porting overhead for every platform.  They increase the dependency
footprint of leveldb and we have spent a lot of effort keeping that
down.  Please come up with a solution that does not require widening
the porting interface.

Original comment by san...@google.com on 23 Apr 2012 at 4:17

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

(Just came back from vacation, sorry for the late reply).

> I am not wild about either of these solutions.  They add to
> porting overhead for every platform.  They increase the dependency
> footprint of leveldb and we have spent a lot of effort keeping that
> down.  Please come up with a solution that does not require widening
> the porting interface.

I understand, I am not really wild about these myself. We will take a deeper 
look at the source code to better understand it, and hopefully find a better 
way to solve this. Just curious though, is there any specific reason why the 
singleton is heap-allocated? At first sight, it looks like that a static const 
declaration should be equivalent and would avoid the trivial memory leak 
associated with the current code.

Thanks.

Original comment by di...@google.com on 3 May 2012 at 8:26

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

For the record, I have uploaded a first attempt at solving the problem here: 
http://codereview.appspot.com/6192056/

Note that this doesn't introduces extra platform-specific porting overhead, but 
allows for the definition of a StaticAtomicPointer variable that can be 
initialized at link-time.

Original comment by di...@google.com on 9 May 2012 at 9:15

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

> Note that this doesn't introduces extra platform-specific porting
> overhead, but allows for the definition of a StaticAtomicPointer variable
> that can be initialized at link-time.

I am not quite sure why you say that this does not introduce extra
platform-specific porting overhead. StaticAtomicPointer has to be
implemented for every port. For example, chromium has its own
port_chromium.h (not shipped as part of the leveldb source code). That will
have to be fixed. Somebody else has implemented port_win.h. That will also
need an implementation of StaticAtomicPointer.

Original comment by san...@google.com on 10 May 2012 at 11:11

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Humm... This patch only addresses the current official leveldb source tree, 
which provides neither port_chromium.h or port_win.h.

If other projects have their own version of the library, I'd be glad to hunt 
down these and provide patches (actually, my goal is to have these changes 
back-ported to Chromium too). However, I believe that if someone forks a 
project, it is their responsability to keep their code up-to-date with upstream 
changes. If this is not the case, it would be useful to have some sort of 
comment in the header that warns about this special conditions.

Please let me know which other projects are concerned with this change, and 
I'll take a look at their sources and try to provide patches.

While we're on the topic of Windows, the current implementation of 
AtomicPointer for this platform in port/atomic_pointer.h is sub-optimal for 
windows-x86 / MSVC. It uses a MemoryBarrier() macro which implements a full 
memory barrier operation (around 15ns on recent hardware). Such a thing is not 
needed on x86 and x86_64 where a simple compiler barrier is sufficient. One can 
use _ReadWriteBarrier() to achieve this with MSVC.

Regarding the ARM implementation in port/port_android.h, it is also incorrect 
and sub-optimal. It is incorrect because the ARMv6 architecture doesn't support 
the "dmb" instruction to implement a full memory barrier (instead you have to 
use a different co-processor instruction). Also, always using a "dmb" 
instruction on single-core ARMv7 devices is both un-necessary and very costly 
(e.g. around 210ns on a HTC Nexus One, this is slower than a full 
pthread_mutex_lock/pthread_mutex_unlock pair, which takes about 180ns on the 
same device). The solution is simply to always call the kernel helper function, 
because it provides an optimization that is always optimized for the device it 
runs on (e.g. a no-op on single-core devices). Benchmarking shows that the cost 
of the call is minimal (around 1ns for no-ops).

I can provide patches for these issues if you think they're useful. Let me know 
if there are other "unknowns" that I need to be aware of though.

Original comment by di...@google.com on 11 May 2012 at 9:41

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

To be more exact, when I say it doesn't introduce extra platform-specific 
porting, I mean that it doesn't try to introduce an atomic compare-and-swap 
method to AtomicPointer.

This would be useful to avoid the racy allocations (since we don't have static 
mutexes), but implementing these correctly is a lot more complex than 
Acquire_Load / Release_Store, since you can't really do that without some 
inline-assembly (which is typically compiler-specific) and good knowledge of 
target CPU. An alternative is to use compiler or platform specific extensions 
like __sync_fetch_and_add (for GCC and Clang), or 
InterlockedCompareExchangePointer on Windows. But that leaves out alternative 
compilers on alternative platforms (e.g. suncc on Solaris, and whatnot).

The StaticAtomicPointer definition is a tiny rewrite of AtomicPointer that is 
trivial to adapt to an existing implementation. The only difference between the 
two is the presence of constructors/destructors, the fact that the member must 
be public in StaticAtomicPointer, and the definition of 
LEVELDB_STATIC_ATOMIC_POINTER_INIT (where the default of { NULL } probably 
works everywhere).

Original comment by di...@google.com on 11 May 2012 at 9:51

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

A side-effect of your change is that everybody who has written a custom port
file will have to do some work. If we are willing to live with that, I would
rather not extend port.h with StaticAtomicPointer, but with functionality
like that provided by pthread_once_init(). That will provide direct support
for what we need to do (lazy initialization) and does not suffer from the
race that the StaticAtomicPointer based approach has. Plus, it is trivial to
port I think. Posix platforms can just use pthread_once_t. Windows has
similar functionality (InitOnceExecuteOnce).

I will work on something like that.

Original comment by san...@google.com on 11 May 2012 at 4:46

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I don't see how using a pthread_once_t here would prevent custom ports from 
doing any work.

Apart from that, it would *definitely* be a better solution :-).

Note however that InitOnceExecuteOnce is only available since Windows Vista. 
Using this technique excludes the code from running on Windows XP, which isn't 
likely to be acceptable for some projects (including Chromium).

I don't think there is any Win32 primitive that can be used to perform 
thread-safe lazy initialization before Vista. That's why Chromium implements 
its own LazyInstance<> template that heavily relies on atomic operations and a 
spin-lock (since even CRITICAL_SECTION objects can't be statically initialized).

In a sense this means that port_chromium.h could be adapted to use it instead 
of InitOnceExecuteOnce, but there may be other Windows-based project that don't 
have this mechanism available.

Original comment by di...@google.com on 12 May 2012 at 3:12

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

> I don't see how using a pthread_once_t here would prevent custom ports
> from doing any work.

It won't. My point was that if we are going to make authors of port header
files do some work, might as well have them implement exactly the
functionality we need instead of adding something that can only be used
in a somewhat hacky/racy way.

> Apart from that, it would *definitely* be a better solution :-).

> Note however that InitOnceExecuteOnce is only available since Windows
> Vista. Using this technique excludes the code from running on Windows XP,
> which isn't likely to be acceptable for some projects (including
> Chromium).
>
> I don't think there is any Win32 primitive that can be used to perform
> thread-safe lazy initialization before Vista. That's why Chromium
> implements its own LazyInstance<> template that heavily relies on atomic
> operations and a spin-lock (since even CRITICAL_SECTION objects can't be
> statically initialized).
>
> In a sense this means that port_chromium.h could be adapted to use it
> instead of InitOnceExecuteOnce, but there may be other Windows-based
> project that don't have this mechanism available.

They (or the author of port_win32.h) would have to provide an implementation
that internally uses interlocked operations. This is similar to what your
change was doing, but at least hides the ugliness inside a clean interface
and limits it to one port.

Original comment by san...@google.com on 14 May 2012 at 4:36

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by san...@google.com on 30 May 2012 at 5:11

Changed state: Fixed
Added labels: ****
Removed labels: ****

liuis / leveldb

There is a static initializer generated in util/comparator.cc #75