Various tweaks to semaphore (and makefile)

cameron314 commented 9 years ago

This is a bit of a mixed bag of a commit, so I totally understand if you don't want to merge it as-is. I figured I'd let you choose what you like, if anything, instead of just forking silently.

First, thank you for writing the cross-platform + lightweight semaphore! I had had in mind to do something remarkably similar for absolute ages, but just never had the time to get around to it. Now I don't have to :-) Your blog post, too, was high-quality as usual. I've taken the lightweight semaphore and used it to implement a blocking wrapper around both of my lock-free queues (spsc and mpmc).

The changes I made are:

I replaced the #include of windows.h with manually-declared prototypes for the specific Win32 semaphore functions called. This is definitely a controversial change, as Microsoft tends to frown on this sort of thing, but I absolutely hate including windows.h in a header. Matter of preference, I suppose.
The platform-specific semaphores are implemented using the int type, but there's no reason the lightweight semaphore can't go beyond that and use the signed equivalent of size_t -- after all, the only values passed on to the platform-specific semaphores are the negative counts corresponding to the number of waiting threads, and those should fit in an int, even if the maximum count can go beyond that limit. So I changed this, introducing a ssize_t type.
I added an availableApprox() method to the lightweight semaphore that returns the available count. This is useful for debugging, unit tests, probabilistic algorithms, etc.
I added tryWaitMany() and waitMany() methods that "acquire" many items at once, instead of just one; this is required in order to implement e.g. a blocking bulk dequeue like the one I have in my MPMC queue. They could potentially be useful to others looking to do bulk operations with a semaphore.
I changed the tryWait() method to use a CAS-loop instead of just a single CAS, since under contention the CAS will often fail even if the count is way above 0. (Obviously, the wait() method doesn't strictly need this since it proceeds to spin-wait after anyway, but it matters for those who call the tryWait() method directly without a follow-up call to wait().)
I also fixed cmake 'Unix Makefiles' generation to work when on Windows/MinGW64.

Your unit tests still pass. I can't tell if there's any effect on performance since the tests seem to take a wildly different amount of time to execute each time on my PC (which admittedly is a rather wimpy AMD C-50 dual-core netbook processor -- I had to lower the iteration count temporarily on some of them otherwise it would have taken over half an hour to get the test results, heh).

Thank you for your hard work!

hlide commented 9 years ago

FYI, sizeof(int) != sizeof(ssize_t) in 64-bit environment: in AMD64, it will issue more lengthy instructions (64-bit operation on register needs an instruction prefix while 32-bit operation doesn't) for no big deal (I won't expect for more than 2 billion threads to wait !).

cameron314 commented 9 years ago

@hlide: Yes, that was my intention. I didn't know the instruction length was different, though -- does it make any difference to performance? EDIT: It does not. Both std::atomic<int> and std::atomic<long long> take identical amounts of time to do a FAA and a CAS, respectively, at least on my 64-bit Intel Core 2 processor. See my benchmark here.

In any case, it's not for the number of waiting threads (when the count is negative), but rather for the number of available items (when the count is positive).

hlide commented 9 years ago

@cameron314: as long as the instruction cache can absorb those extra bytes, it won't. Suppose that all the code of a big program was using 'int' and was fine in 64-bit, then you switch 'int' into 'ssize_t', you may end up with a bigger binary size which is not that negligible. The same for data size too (which is as twice as it was so data cache will certainly suffer because a cache line will only handle the half of what it could before).

preshing commented 9 years ago

Thanks for the MinGW fix. I'm half tempted by the <windows.h> trick, but I'll leave it alone for now.

hlide commented 9 years ago

@cameron314: off-topic I'm quite interested with your free-lock SPSC and MPMC queues implementations, but I have a question. I need a free-lock MPSC queue and I was wondering how to achieve it (blocking version or not).

cameron314 commented 9 years ago

Ah, yes, I hadn't considered the cacheline implications. Interesting.

@hlide: Hmm, MPSC. I don't know of any off the top of my head -- you're probably best off picking a general MPMC queue and just using that. Mine might not offer the best characteristics for that scenario, as it's built out of SPMC queues internally. The best way is to try a few and find the one that's fastest for your particular workload (by profiling). Just know that both SPMC and MPSC can be engineered much more efficiently than MPMC.

preshing / cpp11-on-multicore

Various tweaks to semaphore (and makefile) #3