Just confirming: Concurrency issues?

saurabhnanda commented 8 years ago

Do these bindings have any concurrency issues on the lines of https://github.com/bos/mysql/issues/11 ? Just confirming.

lpsmith commented 8 years ago

Not that I'm aware of. libpq is a real joy to work with in comparison to odbc or mysqlclient. All of the state is in the connection object, and libpq does not use any thread-local state.

That said, there probably are issues with postgresql-libpq in the presence of asynchronous exceptions, causing memory leaks and possibly other problems. Also, postgresql-libpq does not attempt to add any kind of concurrency safety; this is the responsibility of a higher-level binding. For example, postgresql-simple stores the connection object pointer in an MVar to provide a modicum of concurrency safety, and that's been used in production by myself and others with no issues that I'm aware of. You can have both multiple connections and multiple haskell threads concurrently using a single connection without any known problems.

Also note that postgresql-libpq is also trivially vulnerable to use-after-free memory faults if you use a connection after it's been explicitly closed. It is also the responsibility of a higher-level binding to fix this problem. postgresql-simple does this by (conceptually) representing a connection by an MVar (Maybe LibPQ.Connection), though in actuality it's an MVar LibPQ.Connection with Nothing represented by a null pointer in order to avoid adding an additional layer of indirection.

saurabhnanda commented 8 years ago

All of the state is in the connection object, and libpq does not use any thread-local state.

Is using thread-local state a problem because of Haskell's IO multi-plexing? That is, every forkIO thread doesn't really map to an single dedicated OS thread, and if the underlying library assumes that, it can very easily get into the following aituation:

FFI call blocks an OS thread, which means the corresponding Haskell thread will block as well
Haskell's RTS schedules another forkOS thread, but it will now end-up scheduling it over a blocked OS thread, or the number of available OS threads for scheduling will reduce. Former will cause a deadlock, latter will degrade concurrent performance.

Is my understanding correct?

saurabhnanda commented 8 years ago

You can have both multiple connections and multiple haskell threads concurrently using a single connection without any known problems.

Shouldn't it be absolutely forbidden to share a connection between Haskell threads? Why not? How will the library (postresql-simple, or this library, or the C library) handle two threads sending queries on the same connection?

lpsmith commented 8 years ago

It sounds like your understanding isn't too far off, but it's unclear enough that you might want to read my brief overview of concurrency in GHC that I wrote up a number of years ago. Roman also wrote up the issue you linked to, here.

Thread local state is not necessarily a show-stopper, but a foreign library that uses it will put non-local constraints on structure of your Haskell program, due to the need to use forkOS to ensure that foreign calls happen in a particular OS thread, and not sharing those foreign resources among haskell threads.

The fact that libpq is explicit about all of its state means that, even if postgresql-libpq is not thread safe, it's easy to make higher level wrappers thread safe. Now, any given higher level wrapper may or may not be thread safe, for varying values of safety, but postgresql-simple is, because it serializes libpq calls to individual connection objects (but not global serialization, of course). Thus, while it's perfectly possible to introduce higher-level race conditions, (e.g. by using transactions or other situations where you need to prohibit certain interleavings of round-trips to postgres) these lower level races are no longer a concern.

saurabhnanda commented 8 years ago

Thank you for pointing me to your blog post. Very information and very well written.

The FFI supports two kinds of calls: safe and unsafe. A “safe” call blocks the calling OS thread and Haskell thread until the call finishes. Blocking these is unavoidable. However, a safe call does not block the capability. When GHC performs a safe call, it performs the call inside the current OS thread, which is a capability. The capability then moves to another OS thread. If no other threads are available, GHC will transparently create a new OS thread. OS threads are pooled to try to avoid creating or destroying them too often.

That was a key-point I was missing. The OS thread AND the Haskell thread, both, are blocked. And if the RTS is low on number of unblocked threads, it will create more transparently. Pretty cool.

An unsafe call blocks the capability in addition to blocking the OS and Haskell threads. This was a pretty big deal when GHC only supported a single capability. In effect, an unsafe call would block not only the Haskell thread that made the call, but every Haskell thread. This gives rise to the myth that unsafe calls block the entire Haskell runtime, which is no longer true. An unsafe foreign call that blocks is still undesirable, but depending on configuration, may not be as bad as it used to be.

And this is where the confusion re-surfaces. From a scheduling/blocking perspective, safe vs unsafe calls seem to have the exact same behaviour. Neither seems to a be a big problem. Or am I missing something?

From the top comment:

Just to make it clearer (for anyone reading) what blocking a capability means: it means all Haskell threads scheduled on that capability are blocked for the duration of the call, not just the one making the foreign call. If the function completes quickly (or if there aren’t any other Haskell threads scheduled on the capability, e.g. if M isn’t (much) larger than N), it won’t matter. Otherwise, it might.

And back to the confusion. I thought we were talking of an M-N model where any Haskell thread may be scheduled on top of any capability. Or, are Haskell threads "pinned" to a single capability and always scheduled on that?

Ok, I have confirmed that that you are absolutely correct in that Haskell threads are assigned a capability, and that those threads will not be rescheduled on a different capability during an unsafe FFI call. So thus, in effect, an unsafe call blocks all the Haskell threads on the capaibility that it is blocking. Thanks for the clarification!

So, is this the final understanding as of 2016 and GHC 7.8+ (the original blog post is from 2011, so may be outdated)?

No FFI -- any thread can be scheduled on any capability (OS thread)
Safe FFI -- more overhead required to preserve the "No FFI" behaviour
Unsafe FFI -- OS thread blocks, and so do all Haskell threads scheduled on top of that OS thread. Also, Haskell threads are "pinned" to OS threads.

lpsmith commented 8 years ago

The difference between safe and unsafe is that safe does not block the capability it's executing on; whereas with unsafe it does. In effect, a safe call blocks only the OS thread and the Haskell thread, whereas an unsafe call blocks the OS thread, the Haskell thread, and all other Haskell threads that happen to be scheduled on the same capability.

lpsmith commented 8 years ago

No, Haskell threads are not pinned to OS threads, or capabilities. A haskell thread will be executed on a single OS thread, but that OS thread will change over time. Capabilities are the OS threads that are executing Haskell code; there's a fixed number of capabilities but the OS threads they use also change over time. And Haskell threads will migrate between capabilities over time as well.

However, haskell threads will not migrate away from a capability while that capability is executing an unsafe FFI call; so that FFI call blocks all the haskell threads currently scheduled on the capability.

saurabhnanda commented 8 years ago

I don't understand why we need the "capabilities" abstraction over underlying threads, but does it matter significantly for the purpose of this discussion?

However, haskell threads will not migrate away from a capability while that capability is executing an unsafe FFI call; so that FFI call blocks all the haskell threads currently scheduled on the capability.

So, the way unsafe-FFI calls will degrade performance is by blocking more than one Haskell thread. It will not impact other threads, which have not been scheduled on the blocked underlying capability/OS-thread. Other Haskell threads will continue running, and if the scheduler falls short, it will spawn more OS threads. Right?

From a webapp point of view (where each incoming request spawns a new Haskell thread, and where each thread needs to do a DB operation), does it basically mean that concurrency will be limited to the number of underlying capabilities/OS-threads that the RTS spawns?

lpsmith commented 8 years ago

Capabilities are not an abstraction; they are an implementation detail.

And no, concurrency is not limited by the number of capabilities; there can be more OS threads than capabilities to deal with blocking FFI calls, and GHC's native IO is non-blocking under the hood. Many Haskell programs will have significantly more Haskell threads than capabilities.

Parallelism (of the computational variety) is limited by the number of capabilities, but it's also limited by the number of CPUs you have. The rule of thumb is that the number of capabilities you set should be less than or equal to the number of CPUs you have, often 1 less, and sometimes even less depending on circumstances

I seem to recall that some people have occasionally found a minor performance benefit to having one or two more capabilities than CPUs, at least in specific circumstances with specific versions of GHC, but maybe my mind is playing tricks on me. Once you have some code in hand, and a reasonable benchmark suite, you can find out what works best for you.

saurabhnanda commented 8 years ago

Any thoughts about the following? This was the main reason why I started investigating this issue.

From a webapp point of view (where each incoming request spawns a new Haskell thread, and where each thread needs to do a DB operation), does it basically mean that concurrency will be limited to the number of underlying capabilities/OS-threads that the RTS spawns?

lpsmith commented 8 years ago

Well, postgresql-simple moved to asynchronous operation of libpq several years ago, so no, postgresql-simple will be using the IO manager, not OS threads. Most (all?) of the other publicly available, higher-level wrappers around postgresql-libpq on the other hand continue to use blocking FFI calls, meaning they are making use of OS threads.

However, the reasons for going async was not performance, and my current projects are not performance sensitive in that way. So I haven't really gotten to test the difference; and while some of the other postgresql-simple users could probably test this, I haven't heard any reports regarding the performance impacts.

Of course, HaSQL is notable in that it supports binary parameters and binary results; this has several less than obvious tradeoffs, but it is certainly a substantial performance boost for transferring some types of values such as integers, floats, and timestamps. It would be interesting to try HaSQL in some of my projects, but unfortunately it continues to be a bit too opinionated on several counts, in ways that are show-stopping issues for any realistic experiment.

saurabhnanda commented 8 years ago

Well, postgresql-simple moved to asynchronous operation of libpq several years ago, so no, postgresql-simple will be using the IO manager, not OS threads. Most (all?) of the other publicly available, higher-level wrappers around postgresql-libpq on the other hand continue to use blocking FFI calls, meaning they are making use of OS threads.

Wouldn't the FFI layer "terminate" at postgresql-libpq (i.e. this project)? Wouldn't everything in postgresql-simple be in pure Haskell and be not bothered about the FFI details? Or is this that postgresql-simple gets to choose whether it wants to do async operations or sync operations (with postgresql-libpq supporting both).

So, if I understand you correctly, are you saying the following:

libpq allows two modes of operation -- sync (which blocks the OS thread), and async (which would be some sort of callback/polling based interface)
postgresql-simple uses the async version of library calls, and therefore, always participates in cooperative multi-tasking.

PS: Sorry to be bugging you about this.

lpsmith commented 8 years ago

Basically, yes, you are correct. Actually, postgresql-simple on windows still uses blocking FFI calls (due to upstream issues with GHC), so whenever it's blocked it's blocked inside a safe FFI call in an OS thread. But on unix, when postgresql-simple blocks, it is using GHC's IO manager to block, not an OS thread.

See the exec function.

Actually, postgresql-simple on unix will in some cases use OS threads to block when calling PQ.sendQuery, I'd have to investigate sending queries asynchronously. But for most use cases, this doesn't happen as often as blocking on the results of queries.

lpsmith / postgresql-libpq

Just confirming: Concurrency issues? #33