safeCall blocks on large payloads

davnils commented 9 years ago

Hello wonderful people contributing to this project :)

I'm writing a distributed compiler using the client/server abstraction and ran into an issue with calls to safeCall (or other versions of call) not showing up on the server side.

Here is a reduced version (literally made by slicing down the existing project) capable of reproducing the issue: https://github.com/davnils/kool/tree/testcase The pinned down dependencies are based on version in this post: https://groups.google.com/forum/#!topic/parallel-haskell/x3y6XNpTEFw

Basically there are executions of the client that end up stalling on call, which seems to occur with greater frequencies for larger payloads. I have reproduced the failing test case on osx but seen similar behaviour on ubuntu.

Any ideas on what could be wrong? Would appreciate any pointers or ideas on debugging the internals.

hyperthunk commented 9 years ago

I would suggest enabling tracing on the server node, to see if the messages are actually arriving at all. Look at http://hackage.haskell.org/package/distributed-process-0.5.5.1/docs/Control-Distributed-Process-Debug.html for setting up tracing on either node (or from the other, if you see what I mean).

Another thing you might like to do is monitor the server process in the client for the duration of the program, but tracing/debugging the server process ought to suffice. I suspect either the server dies with some sort of error after one or more calls, or there is a more fundamental issue in the lower library code (either distributed-process, or network-transport layers) beneath which leads to the problem.

davnils commented 8 years ago

Thanks for the information! I've tried tracing using env variables on server/client.

Example from successful invocation on server:

...
 Sat Dec  5 11:45:37 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\fbuild_server\SOH" :: (5c165eac86be1c65,67e   664e295ebd8aa)
 Sat Dec  5 11:45:45 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:10 "\SOH\NUL\NUL\NUL\NUL\NUL\DC2'## 1 \"<stdin>[huge payload removed]
 Sat Dec  5 11:45:45 UTC 2015 - MxSent pid://127.0.0.1:8888:0:10 pid://127.0.0.1:9999:0:10 [unencoded message] :: CallResponse [Char]
 Sat Dec  5 11:45:45 UTC 2015 - MxNodeDied nid://127.0.0.1:8888:0 DiedDisconnect
 Sat Dec  5 11:45:57 UTC 2015 - MxNodeDied nid://127.0.0.1:8888:0 DiedDisconnect

With failed one just missing the receive call (terminated manually after some time):

...
Sat Dec  5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:5
Sat Dec  5 11:40:06 UTC 2015 - MxRegistered pid://127.0.0.1:9999:0:4 "tracer.initial"
Sat Dec  5 11:40:06 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:5 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\SOtracer.initial\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec  5 11:40:06 UTC 2015 - MxProcessDied pid://127.0.0.1:9999:0:5 DiedNormal
Sat Dec  5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:6
Sat Dec  5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:7
Sat Dec  5 11:40:06 UTC 2015 - MxRegistered pid://127.0.0.1:9999:0:6 "mx.table.coordinator"
Sat Dec  5 11:40:06 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:7 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DC4mx.table.coordinator\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec  5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:8
Sat Dec  5 11:40:06 UTC 2015 - MxProcessDied pid://127.0.0.1:9999:0:7 DiedNormal
Sat Dec  5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:9
Sat Dec  5 11:40:06 UTC 2015 - MxRegistered pid://127.0.0.1:9999:0:8 "logger"
Sat Dec  5 11:40:06 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:9 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ACKlogger\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec  5 11:40:06 UTC 2015 - MxProcessDied pid://127.0.0.1:9999:0:9 DiedNormal
Sat Dec  5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:10
Sat Dec  5 11:40:06 UTC 2015 - MxRegistered pid://127.0.0.1:9999:0:10 "build_server"
Sat Dec  5 11:40:06 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\fbuild_server\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec  5 11:42:29 UTC 2015 - MxNodeDied nid://127.0.0.1:8888:0 DiedDisconnect

On client side I see the final send event:

....
Sat Dec  5 11:44:28 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:8
Sat Dec  5 11:44:28 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:9
Sat Dec  5 11:44:28 UTC 2015 - MxRegistered pid://127.0.0.1:8888:0:8 "logger"
Sat Dec  5 11:44:28 UTC 2015 - MxReceived pid://127.0.0.1:8888:0:9 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ACKlogger\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec  5 11:44:28 UTC 2015 - MxProcessDied pid://127.0.0.1:8888:0:9 DiedNormal
Sat Dec  5 11:44:28 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:10
Sat Dec  5 11:44:28 UTC 2015 - MxReceived pid://127.0.0.1:8888:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\fbuild_server\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:9999:0u\246\US\DC4\NUL\NUL\NUL\n" :: (53cceda13633b212,b80146510d3a0638)
Sat Dec  5 11:44:28 UTC 2015 - MxSent pid://127.0.0.1:9999:0:10 pid://127.0.0.1:8888:0:10 [unencoded message] :: Message [Char] [Char]

but it's missing in the failing case, looks similar otherwise:

...
Sat Dec  5 11:45:47 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:8
Sat Dec  5 11:45:47 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:9
Sat Dec  5 11:45:47 UTC 2015 - MxRegistered pid://127.0.0.1:8888:0:8 "logger"
Sat Dec  5 11:45:47 UTC 2015 - MxReceived pid://127.0.0.1:8888:0:9 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ACKlogger\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec  5 11:45:47 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:10
Sat Dec  5 11:45:47 UTC 2015 - MxProcessDied pid://127.0.0.1:8888:0:9 DiedNormal
Sat Dec  5 11:45:47 UTC 2015 - MxReceived pid://127.0.0.1:8888:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\fbuild_server\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:9999:0\DLE\206\GS\218\NUL\NUL\NUL\n" :: (53cceda13633b212,b80146510d3a0638)

So it's seems non-trivial to solve this.