Closed SX91 closed 7 years ago
Hi and thanks for the detailed bug description and the test case.
What operating system are you using? The waiting code is platform dependent. The Windows part loops several times introducing a delay with exponential back-off in each iteration. I might have messed up some tail recursion.
Hi! Linux, Fedora 24 with latest updates. GHC 7.10.3 (LTS-6.25).
Maybe you have to bracket unsafeRead/WriteSTM and execute the unregister IO action it comes with?
UPD: Replacing threadWaitReadSTM
with threadWaitRead
fixes the problem.
threadWaitReadSTM :: Fd -> IO (Sync.STM (), IO ())
threadWaitReadSTM fd
#ifndef mingw32_HOST_OS
| threaded = Event.threadWaitReadSTM fd
#endif
| otherwise = do
m <- Sync.newTVarIO False
_ <- Sync.forkIO $ do
threadWaitRead fd
Sync.atomically $ Sync.writeTVar m True
let waitAction = do b <- Sync.readTVar m
if b then return () else retry
let killAction = return ()
return (waitAction, killAction)
https://hackage.haskell.org/package/base-4.9.0.0/docs/src/GHC.Conc.IO.html#threadWaitReadSTM
Just wow! It creates the TVar, forks a new greenthread and defines killAction as return ()
!? So that's the whole reason we have a leak.
Mhh, that is interesting.
It is noticeable that this is only a problem with the non-threaded runtime. Your fix contains a race condition I hoped to have circumvented in the original version (I'm not so sure anymore whether I succeeded for all combinations of platform and runtime environments): The interest in the file descriptor shall be registered while holding the lock on the socket while the waiting itself shall not hold the lock (withMVar
, see below). The whole issue might also have something to do with issue #25. I'd like to properly understand and fix both together.
tryWaitRetryLoop :: Socket f t p -> (Fd -> Int-> IO (IO ())) -> (Fd -> Ptr CInt -> IO CInt) -> IO CInt
tryWaitRetryLoop (Socket mfd) getWaitAction action = loop 0
where
loop iteration = do
ewr <- withMVar mfd $ \fd-> alloca $ \errPtr-> do
when (fd < 0) (throwIO eBadFileDescriptor)
i <- action fd errPtr
if (i < 0) then do
err <- SocketException <$> peek errPtr
unless (err == eWouldBlock || err == eAgain) (throwIO err)
Left <$> getWaitAction fd iteration -- This call is non-blocking!
else
return (Right i)
case ewr of
Left wait -> do
wait -- This call is blocking!
loop $! iteration + 1
Right result -> do
return result
That's pretty bad as you can't really fix unsafeSocketWait{Read,Write}
without modifying the Base
.
I consider threadWaitReadSTM
implementation buggy. I can't see any reason for killAction
to be return ()
instead of threadKill threadID
.
Ah, it's leaking TVar
s (both with the threaded and non-threaded runtime):
Without -threaded
:
With -threaded
:
You're right: My code ignores the second value returned by threadWaitReadSTM
and it shouldn't.
The documentation says:
The second returned value is an IO action that can be used to deregister interest in the file descriptor.
It is rather an obligation than an option. The reason this didn't cause problems earlier is that as soon as the socket receives an event the registrations are released automatically as they are OneShot
. Creating more and more registrations without any socket events happening (as in the test case) is rather unusual but nonetheless reveals a serious bug.
Still, the fallback implementation of threadWaitReadSTM
(that is used with the single-threaded RTS) seems broken as you pointed out: It's leaking threads. Changing the
UPD Bug has already been filed and fixed and the proposed killAction
to threadKill threadID
though might cause another undesired race. I'll think about this a little further and eventually file a bug report for the base
library as soon as I'm sure I haven't overseen anything.threadKill
is a valid solution (see https://ghc.haskell.org/trac/ghc/ticket/12852)
So, I got more familiar with GHCs internals and how the event manager interacts with the waiting mechanism. This is the implementation of threadWait
which uses an MVar
to actually block (it's a very neat idea indeed):
threadWait :: Event -> Fd -> IO ()
threadWait evt fd = mask_ $ do
m <- newEmptyMVar
mgr <- getSystemEventManager_
reg <- registerFd mgr (\_ e -> putMVar m e) fd evt M.OneShot
evt' <- takeMVar m `onException` unregisterFd_ mgr reg -- This one is blocking.
if evt' `eventIs` evtClose
then ioError $ errnoToIOError "threadWait" eBADF Nothing Nothing
else return ()
The STM
mechanism always seemed a bit like overkill to me, but it allowed the separation between registration and waiting. Now that it turns out to be broken, it might be worth to reconsider it all and eventually propose an extended threadWait
implementation. I wonder how all the code around manages to not cause horrible race conditions using these primitives ...?! Will do some research on this.
Okay, here is how I think it should look like:
threadWait' :: Exception e => Event -> ((Fd -> IO a) -> IO a) -> e -> IO ()
threadWait' evt withFd e = do
mgr <- getSystemEventManager_
mevt <- newMVar
bracketOnError
( withFd $ \fd->
registerFd mgr (\_ evt' -> putMVar mevt evt') fd evt M.OneShot
)
( void . unregisterFd_ )
( const $ do
evt' <- takeMVar mevt
unless (evt' .&. evt /= 0) (throwIO e)
)
threadWait :: Event -> Fd -> IO ()
threadWait evt fd = threadWait' evt (\f-> f fd) (errnoToIOError "threadWait" eBADF Nothing Nothing)
@lpeterse, what are the possible race conditions in threadWait
that you are talking about?
Meanwhile threadWaitSTM now returns a correct killAction, but I see no single point in using STM approach and then just block, idea of the STM variant is that you will check condition in complex STM transaction, otherwise it just introduce unnecessary complexity and overheads for no reason.
The race condition I have in mind is not in threadWait
itself, but occurs when using it to wait on a file descriptor that is in an MVar under the assumption that the user closes the descriptor before having closed all threads that are eventually still performing reads and writes.
Here is a solution using threadWait
exemplifying the race. It reads the file descriptor from the MVar
and then waits after releasing the lock:
exampleReceive :: MVar Fd -> IO ByteString
exampleReceive mfd = do
fd <- readMVar mfd
threadWaitRead fd -- (1)
withMVar mfd $ \fd-> unsafeReceive fd 4096
My socket library locks the file descriptor for all operations, especially for closing. In the pseudo-code above in (1) the socket could have been closed in the meantime (that is not too bad as it would just cause an exception) but (extremely unlikely but not impossible) it could also already have been reassigned and designate a completely different socket.
Maybe the whole approach of allowing the user to close the socket without first terminating all depending threads is wrong, but on the other hand operations like closeFdWith
imply this is a valid scenario.
Yesterday evening, I implemented the proposed threadWait
that holds the lock for registration, but not for the waiting itself. This makes the STM mechanism obsolete. It works quite well, although some things needed for a clean solution (eventIs
, evtClose
etc) are not exported from the internal GHC module. Oh, and it does not work with the single threaded RTS, of course.
Possibly I'm oversimplifying a problem, so let me clarify.
The problem appear in case if fd is closed after fd is read from mvar but before threadWaitRead is called, and threadWaitRead
does not provide functionality to make that safe? While threadWaitReadSTM
does, as one can call something like:
join $ withMVar mfd $ \fd -> do
(wait, kill) <- threadWaitSTM fd
return (fd, atomically wait `finally` kill)
but threadWaitReadSTM
have a bug at least until 8.0 that have a memory leak in case if data does not arrive.
As a result you need your own threadWaitRead
(or change ghc's) that is similar to the ghc one, but allow to solve your problem?
UPD corrected code a bit.
Yes, that's essentially it.
I have put some thought into it, but I simply don't come up with a solution using the standard threadWaitRead
. It either leads to the described race condition or I cannot release the lock until the wait terminates.
The STM
solution with the finally kill
does the trick, though, but I suspect the overhead is not really to justify.
For the record, just referencing the corresponding GHC ticket I was not yet aware of: https://ghc.haskell.org/trac/ghc/ticket/12852 and https://phabricator.haskell.org/D2729
Thanks to @qnikst
So here are the next TODOs:
threadWaitReadSTM
and threadWaitWriteSTM
.base
(probably by integrating a bug-free copy of these functions right into the library and using a base version-pragma).That's not easy. As you can see in Base's Conc.hs
, it uses threaded
foreign function that is local to the module, it uses some internal Base modules etc.
As for threaded
it's not a problem as it's a synonym to supportsBoundThreads that is exported from Control.Concurrent
, also same FFI call can be constructed:
libraries/base/GHC/Event/Thread.hs
foreign import ccall unsafe "rtsSupportsBoundThreads" threaded :: Bool
isEvent looks a bit more complex.
I think it could be possible to derive a standalone instance of Enum
to recover the contained Int
value. Beautiful is something different...
@qnikst Do you think I should propose the modified threadWaitX
for inclusion in base
or is this a too specific use case and the STM version would be considered appropriate enough?
P.S. I'm currently working on the first TODO to get at least the multi-threaded RTS case fixed (Windows is not affected at all as it uses another mechanism anyway).
I think standard threadWaitRead
can work, at a cost of some overhead, a bit simplified version as one have to carefully think abut async exceptions, that may arrive here and there:
threadWaitRead' :: MVar Fd -> IO ()
threadWaitRead mfd = mask_ $ do
box <- newEmpty
withMVar_ mfd $ \fd -> do
forkIO $ try (threadWaitRead fd) >>= putMVar box
return fd
either throwIO return =<< takeMVar box
in this case we should not have race condition between events.
@lpeterse I'm not sure I'm the best person for that question, I'm not familiar enough with internals. It would be better to consult on freenode:#ghc about the question.
hm.. seems I'm not correct, we should not exit withMVar_
until threadWaitRead
was actually called, but seems it's not a case here. But threadWaitReadSTM
is affected by the same issue anyway, so possibly your proposal is the only stable way forward. I'll think a bit more about this.
Thanks so far. Any further ideas are always welcome :-)
I think I fixed it for both the single-threaded and the multi-threaded runtime. The test-socket
program does no longer leak memory on my machine. I put up a 0.8.x.x branch for validation and testing.
Feel free to review and find bugs/races in the following relevant part:
waitRead :: Socket f t p -> Int -> IO ()
waitRead s _ = wait s threadWaitRead threadWaitReadSTM
waitWrite :: Socket f t p -> Int -> IO ()
waitWrite s _ = wait s threadWaitWrite threadWaitWriteSTM
wait :: Socket f t p -> (Fd -> IO ()) -> (Fd -> IO (STM (), IO ())) -> IO ()
wait (Socket mfd) threadWait threadWaitSTM
| rtsSupportsBoundThreads = mapException
( const eBadFileDescriptor :: IOError -> SocketException )
( bracketOnError
( withMVar mfd $ \fd -> do
when (fd < 0) (throwIO eBadFileDescriptor)
threadWaitSTM fd
) snd ( atomically . fst )
)
| otherwise = do
m <- newEmptyMVar
bracketOnError
( withMVar mfd $ \fd-> do
when (fd < 0) (throwIO eBadFileDescriptor)
forkIO $ catch
( threadWait fd >> putMVar m True )
( \(SomeException _)-> putMVar m False )
) killThread
( const $ takeMVar m >>= flip unless (throwIO eBadFileDescriptor) )
I'll close this as fixed. It will be part of the 0.8.0.0 release.
I've written some small test program to illustrate heap growing on
unsafeSocketWaitRead
with near 100% CPU load (https://github.com/SX91/haskell-socket-test). Run it withtest-socket 300
(300 microseconds to make heap grow faster).Timeout (task cancellation) itself causes heap and CPU load increasingly grow.