Closed GoogleCodeExporter closed 9 years ago
Ping :)
Original comment by pli...@google.com
on 17 Apr 2012 at 3:16
I sent this response on March 13th. I think you missed
it because for some reason email responses go just to
the leveldb@ mailing list and you might not be subscribed
to that:
---
I am not a huge fan of adding more porting cruft to leveldb.
How many nanoseconds is this particular global initialization adding to
chrome startup? (You should be able to estimate it by a quick look
at the disassembly and taking a microbenchmark of malloc cost.)
And how does that compare just the cost of just paging in the leveldb code
into memory?
Original comment by san...@google.com
on 17 Apr 2012 at 6:09
You know, if you really want to get disk seeks down a pretty good strategy
would be to precompute out all the parts of files (shared libraries or
otherwise) that you *know* need to be touched during startup. Then before doing
anything else on startup spawn a thread/process that sends the whole lot to the
kernel in one feel swoop either as a a giant iovec or an mlockall (the latter
assumes a separate address space from your app). Sure it overwhelms the
kernel/storage controllers/devices with a huge set of IO requests at once but
they'll amortize the seek overhead as best as possible and fill the buffer
cache with everything you'll need for the rest of startup, avoiding the
unfortunate sequential access that is all too commonly a part of process
startup. It sure would make attempting to remove individual static initializers
even more of a pointless exercise than it is now.
Original comment by cbsm...@gmail.com
on 18 Apr 2012 at 6:30
I have worked with Philippe on the removal of static initializers for Chrome on
Android. Please allow me to share some insight.
The problem with static initializers is two-fold:
1/ First, they make loading the Chrome native library slower. This is not so
much due to the code that needs to run, but the page faults that they requires.
For each one of them, you would typically have:
- A page fault required to load the corresponding .text section page from disk to memory. Believe it or not, this is still very slow on embedded systems despite the use of Flash memory.
- Another page fault required when touching the .bss section's page corresponding to the data that is initialized. This one is really minor in terms of performance, but still exist (and affects overall RAM usage in a small way).
We started removing all static initializers from Chrome on Android last year.
At the time, there were more than 350 entries in the .init_array section (each
entry corresponding to all the initializers in a single source file), and
profiling showed that running the corresponding code took 800ms on a cold
start, and 80ms on a warn one.
The difference (720ms) was entirely due to the i/o page faults. On average,
that's
about 2ms of startup latency per initializer, or 2 million nano-seconds. This is on a device that only uses flash memory, so there are no disk-seek latencies involved. However, this is probably a lower bound due to the fact that we reorder the .text segment (see below).
We have now removed most static initializers from our code base, using various
techniques. It is often possible to replace the statically-initialized data
with real constants, otherwise just delaying the initialization on first use is
what we use.
We can't use static local C++ initialization, because it is not thread-safe on
Windows, and hence disabled explicitely when building Chrome with GCC's
-fno-threadsafe-statics.
Also note that we also re-order the .text segment to pack all functions that
are run at startup on contiguous pages. This helps save RAM footprint and
reduce the number of i/o page faults that occur during library load and
initialization, but still the impact is significant.
We also experimented with a scheme similar to what cbsmith described, but this
didn't help. It actually made things worse on some devices.
Our work has been developed internally at Google, and we're now open-sourcing
all of it by sending patches to the respective upstream projects.
We can provide a patch that performs the initialization of the object lazily
inside BytewiseComparator().
It looks like port/port.h contains AtomicPointer and Mutex classes that allow
the implementation of thread-safe and fast lazy-initialization, so this would
not require additionnal dependencies as initially suggested by Philipped.
Original comment by di...@google.com
on 20 Apr 2012 at 5:19
> It looks like port/port.h contains AtomicPointer and Mutex classes that allow
the implementation of thread-safe and fast lazy-initialization, so this would
not require additionnal dependencies as initially suggested by Philipped.
That would be fine, but I don't see how that would work without
adding a static initializer for a global Mutex object to control
the lazy initialization.
Original comment by san...@google.com
on 20 Apr 2012 at 5:23
I sent this to the mailing list, but it seems the commenting system the right
place to place it. I'll add that I've done this kind of thing before myself,
and even if you eschew iovectors and mlockall and go with a series of 1 byte
reads to trigger the paging you should be able to get those 800ms quite close
to 80ms:
You know, if you really want to get disk seeks down a pretty good strategy
would be to precompute out all the parts of files (shared libraries or
otherwise) that you *know* need to be touched during startup. Then before doing
anything else on startup spawn a thread/process that sends the whole lot to the
kernel in one fell swoop either as a a giant iovec or mlockall (the latter
assumes a separate address space from your app). Sure it overwhelms the
kernel/storage controllers/devices with a huge set of IO requests at once but
they'll amortize the seek overhead as best as possible and fill the buffer
cache with everything you'll need for the rest of startup, avoiding the
unfortunate sequential access that is all too commonly a part of process
startup. It sure would make attempting to remove individual static initializers
even more of a pointless exercise than it is now.
Original comment by cbsm...@gmail.com
on 21 Apr 2012 at 2:51
Damn, that's true the mutex needs to be initialized to.
I can think of two ways to solve this though:
1/ Define a "StaticMutex" class that can be POD-initialized. This is trivial on
Posix systems with PTHREAD_MUTEX_INITIALIZER. For other systems, keep the
original initialization code.
2/ Do not use a mutex at all, but a custom spinlock. That's what Chromium's
LazyInstance<> templace does, but this requires an atomic compare-and-switch
function + a thread yeld function.
Solution 1/ would definitely be simpler because it's slightly more portable.
What do you think about this?
Original comment by di...@google.com
on 22 Apr 2012 at 9:47
> 1/ Define a "StaticMutex" class that can be POD-initialized. This is
> trivial on Posix systems with PTHREAD_MUTEX_INITIALIZER. For other
> systems, keep the original initialization code.
>
> 2/ Do not use a mutex at all, but a custom spinlock. That's what
> Chromium's LazyInstance<> templace does, but this requires an atomic
> compare-and-switch function + a thread yeld function.
I am not wild about either of these solutions. They add to
porting overhead for every platform. They increase the dependency
footprint of leveldb and we have spent a lot of effort keeping that
down. Please come up with a solution that does not require widening
the porting interface.
Original comment by san...@google.com
on 23 Apr 2012 at 4:17
(Just came back from vacation, sorry for the late reply).
> I am not wild about either of these solutions. They add to
> porting overhead for every platform. They increase the dependency
> footprint of leveldb and we have spent a lot of effort keeping that
> down. Please come up with a solution that does not require widening
> the porting interface.
I understand, I am not really wild about these myself. We will take a deeper
look at the source code to better understand it, and hopefully find a better
way to solve this. Just curious though, is there any specific reason why the
singleton is heap-allocated? At first sight, it looks like that a static const
declaration should be equivalent and would avoid the trivial memory leak
associated with the current code.
Thanks.
Original comment by di...@google.com
on 3 May 2012 at 8:26
For the record, I have uploaded a first attempt at solving the problem here:
http://codereview.appspot.com/6192056/
Note that this doesn't introduces extra platform-specific porting overhead, but
allows for the definition of a StaticAtomicPointer variable that can be
initialized at link-time.
Original comment by di...@google.com
on 9 May 2012 at 9:15
> Note that this doesn't introduces extra platform-specific porting
> overhead, but allows for the definition of a StaticAtomicPointer variable
> that can be initialized at link-time.
I am not quite sure why you say that this does not introduce extra
platform-specific porting overhead. StaticAtomicPointer has to be
implemented for every port. For example, chromium has its own
port_chromium.h (not shipped as part of the leveldb source code). That will
have to be fixed. Somebody else has implemented port_win.h. That will also
need an implementation of StaticAtomicPointer.
Original comment by san...@google.com
on 10 May 2012 at 11:11
Humm... This patch only addresses the current official leveldb source tree,
which provides neither port_chromium.h or port_win.h.
If other projects have their own version of the library, I'd be glad to hunt
down these and provide patches (actually, my goal is to have these changes
back-ported to Chromium too). However, I believe that if someone forks a
project, it is their responsability to keep their code up-to-date with upstream
changes. If this is not the case, it would be useful to have some sort of
comment in the header that warns about this special conditions.
Please let me know which other projects are concerned with this change, and
I'll take a look at their sources and try to provide patches.
While we're on the topic of Windows, the current implementation of
AtomicPointer for this platform in port/atomic_pointer.h is sub-optimal for
windows-x86 / MSVC. It uses a MemoryBarrier() macro which implements a full
memory barrier operation (around 15ns on recent hardware). Such a thing is not
needed on x86 and x86_64 where a simple compiler barrier is sufficient. One can
use _ReadWriteBarrier() to achieve this with MSVC.
Regarding the ARM implementation in port/port_android.h, it is also incorrect
and sub-optimal. It is incorrect because the ARMv6 architecture doesn't support
the "dmb" instruction to implement a full memory barrier (instead you have to
use a different co-processor instruction). Also, always using a "dmb"
instruction on single-core ARMv7 devices is both un-necessary and very costly
(e.g. around 210ns on a HTC Nexus One, this is slower than a full
pthread_mutex_lock/pthread_mutex_unlock pair, which takes about 180ns on the
same device). The solution is simply to always call the kernel helper function,
because it provides an optimization that is always optimized for the device it
runs on (e.g. a no-op on single-core devices). Benchmarking shows that the cost
of the call is minimal (around 1ns for no-ops).
I can provide patches for these issues if you think they're useful. Let me know
if there are other "unknowns" that I need to be aware of though.
Original comment by di...@google.com
on 11 May 2012 at 9:41
To be more exact, when I say it doesn't introduce extra platform-specific
porting, I mean that it doesn't try to introduce an atomic compare-and-swap
method to AtomicPointer.
This would be useful to avoid the racy allocations (since we don't have static
mutexes), but implementing these correctly is a lot more complex than
Acquire_Load / Release_Store, since you can't really do that without some
inline-assembly (which is typically compiler-specific) and good knowledge of
target CPU. An alternative is to use compiler or platform specific extensions
like __sync_fetch_and_add (for GCC and Clang), or
InterlockedCompareExchangePointer on Windows. But that leaves out alternative
compilers on alternative platforms (e.g. suncc on Solaris, and whatnot).
The StaticAtomicPointer definition is a tiny rewrite of AtomicPointer that is
trivial to adapt to an existing implementation. The only difference between the
two is the presence of constructors/destructors, the fact that the member must
be public in StaticAtomicPointer, and the definition of
LEVELDB_STATIC_ATOMIC_POINTER_INIT (where the default of { NULL } probably
works everywhere).
Original comment by di...@google.com
on 11 May 2012 at 9:51
A side-effect of your change is that everybody who has written a custom port
file will have to do some work. If we are willing to live with that, I would
rather not extend port.h with StaticAtomicPointer, but with functionality
like that provided by pthread_once_init(). That will provide direct support
for what we need to do (lazy initialization) and does not suffer from the
race that the StaticAtomicPointer based approach has. Plus, it is trivial to
port I think. Posix platforms can just use pthread_once_t. Windows has
similar functionality (InitOnceExecuteOnce).
I will work on something like that.
Original comment by san...@google.com
on 11 May 2012 at 4:46
I don't see how using a pthread_once_t here would prevent custom ports from
doing any work.
Apart from that, it would *definitely* be a better solution :-).
Note however that InitOnceExecuteOnce is only available since Windows Vista.
Using this technique excludes the code from running on Windows XP, which isn't
likely to be acceptable for some projects (including Chromium).
I don't think there is any Win32 primitive that can be used to perform
thread-safe lazy initialization before Vista. That's why Chromium implements
its own LazyInstance<> template that heavily relies on atomic operations and a
spin-lock (since even CRITICAL_SECTION objects can't be statically initialized).
In a sense this means that port_chromium.h could be adapted to use it instead
of InitOnceExecuteOnce, but there may be other Windows-based project that don't
have this mechanism available.
Original comment by di...@google.com
on 12 May 2012 at 3:12
> I don't see how using a pthread_once_t here would prevent custom ports
> from doing any work.
It won't. My point was that if we are going to make authors of port header
files do some work, might as well have them implement exactly the
functionality we need instead of adding something that can only be used
in a somewhat hacky/racy way.
> Apart from that, it would *definitely* be a better solution :-).
> Note however that InitOnceExecuteOnce is only available since Windows
> Vista. Using this technique excludes the code from running on Windows XP,
> which isn't likely to be acceptable for some projects (including
> Chromium).
>
> I don't think there is any Win32 primitive that can be used to perform
> thread-safe lazy initialization before Vista. That's why Chromium
> implements its own LazyInstance<> template that heavily relies on atomic
> operations and a spin-lock (since even CRITICAL_SECTION objects can't be
> statically initialized).
>
> In a sense this means that port_chromium.h could be adapted to use it
> instead of InitOnceExecuteOnce, but there may be other Windows-based
> project that don't have this mechanism available.
They (or the author of port_win32.h) would have to provide an implementation
that internally uses interlocked operations. This is similar to what your
change was doing, but at least hides the ugliness inside a clean interface
and limits it to one port.
Original comment by san...@google.com
on 14 May 2012 at 4:36
Original comment by san...@google.com
on 30 May 2012 at 5:11
Original issue reported on code.google.com by
pli...@google.com
on 13 Mar 2012 at 10:14