vibe-d / vibe.d

Official vibe.d development
MIT License
1.15k stars 284 forks source link

Implementing TCMalloc in Vibe.d #631

Open etcimon opened 10 years ago

etcimon commented 10 years ago

I think the vibe.d performance could be improved by moving a lot of instantiation such as TCPContext into a thread-local malloc using TCMalloc found in gperftools. The performance seems to be 3-4x better than malloc because it's nearly lockless. I'm wondering if this has ever been considered?

s-ludwig commented 10 years ago

Using a lockless or even thread-local allocator would be a good idea indeed. I'm not sure how much it will actually buy in terms of overall performance, though. I think that there was a time, where the manualAllocator didn't use locks. I didn't do an exact comparison, but I think that it didn't make a real difference for the whole system (HTTP server).

etcimon commented 10 years ago

I looked into it a little more and I think it would be logical that __gshared AutoFreeListAllocator[threads] instead of shared AutoFreeListAllocator could improve speed. Most of the benchmarks out there are very optimistic about it

etcimon commented 10 years ago

I agree that it wouldn't be noticeable, it's something like 50ns vs 400ns on alloc/free, so I think it would mostly be useful when implementing custom algorithms above it unrelated to I/O.

s-ludwig commented 10 years ago

Or even just a thread-local FreeList instead of __gshared FreeList[threads] - at least for allocating non-shared objects, shared ones would still have to be allocated using a shared allocator. The only thing is that it becomes absolutely critical that non-shared objects are never accidentally passed between threads.

etcimon commented 10 years ago

it becomes absolutely critical that non-shared objects are never accidentally passed between threads.

The pointers are still valid but free-ing them from a foreign thread causes the problem here because the TLS FreeList won't contain it. So it really makes it useful having this __gshared FreeList[threads], to scan other FreeLists in case the pointer is freed from another thread and so that "exception" part could be synchronized (unless I'm forgetting something?)

etcimon commented 10 years ago

so that "exception" part could be synchronized

I think I've done this before, using a shared bool to indicate if a foreign thread is using the FreeList and which activates synchronized statement for thread-local accesses if true.

s-ludwig commented 10 years ago

Unfortunately, if you need to synchronize the foreign free list to scan it for a pointer, then all threads must synchronize access to that list, too, rendering the thread-local part useless. But language-wise, it's illegal anyway to access a non-shared pointer from a foreign thread (Isolated!T would be an interesting exception here), so I wouldn't try to make that work, but rather try to eliminate all possibilities to pass such pointers between threads and make a loud noise when it happened anyway (assertion). The worst place here is probably the GC, which can invoke finalizers from any thread, regardless of where an object has been created.

s-ludwig commented 10 years ago

I think I've done this before, using a shared bool to indicate if a foreign thread is using the FreeList and which activates synchronized statement for thread-local accesses if true.

But that wouldn't be any more efficient than the typical mutex in the first place (which does more or less the same already).

etcimon commented 10 years ago

But that wouldn't be any more efficient than the typical mutex in the first place (which does more or less the same already).

The duration of the locking is the difference here, being able to avoid locking the whole alloc/free operations through a shared bool can compensate,

s-ludwig commented 10 years ago

Duration of locking is not interesting for performance. Duration of contention is. And this won't change here.

etcimon commented 10 years ago

Duration of locking is not interesting for performance. Duration of contention is. And this won't change here.

I'll have to look that up, but I've had this chain of thought before and I think it was refuted because the processor caches not being in sync if I use a shared bool as a wall. Using a thread-local freelist is probably the only obvious solution

etcimon commented 10 years ago

Just another quote while you're here, I'm writing a compiler for ASN.1 code right now (because I'm thinking of putting all my types as overhead for faster serialization AND rewriting a new SSL library in D).

I started using Pegged to search for .asn1 files in the project directories to compile them to .d files (everything converts directly to D structures - e.g. Information Object Class in ASN.1 is just a templated class object where SYNTAX is the template parameters). This looks really similar to how .dt files could be handled. Do you think it would be faster to write a .dt to .d compiler as a separate program to save heavy compile-time lags every time they're processed?

s-ludwig commented 10 years ago

Do you think it would be faster to write a .dt to .d compiler as a separate program to save heavy compile-time lags every time they're processed?

It would definitely be faster both, for development time and for the actual build process. But the reason I wouldn't want to do that anyway is that I really dislike complicating the build process. I think that with the recent compiler improvements and with DUB's separate compilation mode, that the compile time is now pretty acceptable. Unfortunately Don Clugston seemingly never got to finish his work on reducing the amount of allocations required for CTFE (although there has been important progress). I'm pretty sure that this would finally mitigate the issue.

As for Pegged, I never considered using it for the Diet templates because the specialized parser is already so resource hungry. I can only imagine what happens if an additional compiler-compiler step is added at compile time.

etcimon commented 10 years ago

But the reason I wouldn't want to do that anyway is that I really dislike complicating the build process.

That's probably arguable b/c you have complete control over DUB and could easily integrate it right there ;)

etcimon commented 10 years ago

Though it would be a good reason to add plugin support to dub

s-ludwig commented 10 years ago

Well, there is already "preGenerateCommand" and "preBuildCommand", which can be used to run external tools for such purposes (especially when the path to a dependency will be available, so that dependencies can be run as part of the build process). But even if having DUB indeed makes the issue less bad, I still would much rather invest time in a more efficient compiler than in a more complex build system. It's a really nice asset of D that it's able to do such things and it would be a pity if it wasn't used just because the implementation lacks behind (AFAIK it's for historical reasons, because originally the whole CTFE functionality wasn't planned and thus everything grew in a base which wasn't designed for performance).

etcimon commented 10 years ago

It's a really nice asset of D that it's able to do such things and it would be a pity if it wasn't used just because the implementation lacks behind

It does lack behind, I've taken for granted that it'll never have proper ctfe filesystem access or memoization. Recompiling a mixin string when the underlying code is unchanged, being unable to load/save a specific path, or even networking. It would be amazing to have networking during ctfe, if only to load e.g. language strings from a database to repopulate the default values of a static array at compile-time.

Of course, having a separate toD compiler for these with the entire D toolset is a great option that I've only started to explore since a week ago, I'm seeing much better opportunities this way, and "pegged" being SO simple (compare it to .. boost spirit! pheonix for lambdas!) to use makes it a really great option. It's a pity though that shell commands through preBuildCommands are the only way to hack it into the build process, but I guess there's plenty of time to think it over. The best idea is to have some plugin ability in dub.. it would become fairly simple (full access to dub.json configurations and a great compile-time callback interface is the key here).

s-ludwig commented 10 years ago

While I don't have any serious use case in mind where I'd want to use networking, database access or similar high-level things, doing partial compilation seems to be more an issue (or feature) of the build system rather than one of the compiler. It would for example easily be possible to compile each Diet template to a separate library and only recompile each one only when required.

But regarding high-level code at compile time, while it is not really a hard argument (using "preGenerateCommands" to do malicious things is a simple alternative), such things feel kind of dangerous to me. Sites like dpaste would have a much harder time to keep things sandboxed. And also malocious code would usually much more visible in build scripts that it would be in the depth of the application sources. But, well, if there are strong use cases, this argument is indeed not very strong.

Regarding the plugins, I'll have to think more about it and have to read your proposal in detail, but so far I think that the best way would be to extend what can be done with the shell command interface (e.g. by providing custom fields of dub.json as environment variables to the command).

etcimon commented 10 years ago

It would for example easily be possible to compile each Diet template to a separate library and only recompile each one only when required.

Yes, I've considered using agglomerations of .lib files and it's a fair alternative

such things feel kind of dangerous to me. Sites like dpaste would have a much harder time to keep things sandboxed.

It would be quite dangerous, much like eval()ing a javascript string on nodejs sent from a web client. More powerful always means more dangerous, that doesn't make it less exciting if it is from every standpoint.

Regarding the plugins, I'll have to think more about it and have to read your proposal in detail, but so far I think that the best way would be to extend what can be done with the shell command interface (e.g. by providing custom fields of dub.json as environment variables to the command).

You could think I'm a little crazy, but the idea was that plugins could add some fields to the dub.json structure to forward compile-time variables, e.g. for better bounds on static arrays like static HTTPRouter[DOMAIN_COUNT] g_routers - or for compiling some strings and configs directly into structures - or even for xor'ing them at compile-time to protect from memory snooping (heh.)