Open davnils opened 9 years ago
I would suggest enabling tracing on the server node, to see if the messages are actually arriving at all. Look at http://hackage.haskell.org/package/distributed-process-0.5.5.1/docs/Control-Distributed-Process-Debug.html for setting up tracing on either node (or from the other, if you see what I mean).
Another thing you might like to do is monitor the server process in the client for the duration of the program, but tracing/debugging the server process ought to suffice. I suspect either the server dies with some sort of error after one or more calls, or there is a more fundamental issue in the lower library code (either distributed-process, or network-transport layers) beneath which leads to the problem.
Thanks for the information! I've tried tracing using env variables on server/client.
Example from successful invocation on server:
...
Sat Dec 5 11:45:37 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\fbuild_server\SOH" :: (5c165eac86be1c65,67e 664e295ebd8aa)
Sat Dec 5 11:45:45 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:10 "\SOH\NUL\NUL\NUL\NUL\NUL\DC2'## 1 \"<stdin>[huge payload removed]
Sat Dec 5 11:45:45 UTC 2015 - MxSent pid://127.0.0.1:8888:0:10 pid://127.0.0.1:9999:0:10 [unencoded message] :: CallResponse [Char]
Sat Dec 5 11:45:45 UTC 2015 - MxNodeDied nid://127.0.0.1:8888:0 DiedDisconnect
Sat Dec 5 11:45:57 UTC 2015 - MxNodeDied nid://127.0.0.1:8888:0 DiedDisconnect
With failed one just missing the receive call (terminated manually after some time):
...
Sat Dec 5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:5
Sat Dec 5 11:40:06 UTC 2015 - MxRegistered pid://127.0.0.1:9999:0:4 "tracer.initial"
Sat Dec 5 11:40:06 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:5 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\SOtracer.initial\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec 5 11:40:06 UTC 2015 - MxProcessDied pid://127.0.0.1:9999:0:5 DiedNormal
Sat Dec 5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:6
Sat Dec 5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:7
Sat Dec 5 11:40:06 UTC 2015 - MxRegistered pid://127.0.0.1:9999:0:6 "mx.table.coordinator"
Sat Dec 5 11:40:06 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:7 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DC4mx.table.coordinator\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec 5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:8
Sat Dec 5 11:40:06 UTC 2015 - MxProcessDied pid://127.0.0.1:9999:0:7 DiedNormal
Sat Dec 5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:9
Sat Dec 5 11:40:06 UTC 2015 - MxRegistered pid://127.0.0.1:9999:0:8 "logger"
Sat Dec 5 11:40:06 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:9 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ACKlogger\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec 5 11:40:06 UTC 2015 - MxProcessDied pid://127.0.0.1:9999:0:9 DiedNormal
Sat Dec 5 11:40:06 UTC 2015 - MxSpawned pid://127.0.0.1:9999:0:10
Sat Dec 5 11:40:06 UTC 2015 - MxRegistered pid://127.0.0.1:9999:0:10 "build_server"
Sat Dec 5 11:40:06 UTC 2015 - MxReceived pid://127.0.0.1:9999:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\fbuild_server\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec 5 11:42:29 UTC 2015 - MxNodeDied nid://127.0.0.1:8888:0 DiedDisconnect
On client side I see the final send event:
....
Sat Dec 5 11:44:28 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:8
Sat Dec 5 11:44:28 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:9
Sat Dec 5 11:44:28 UTC 2015 - MxRegistered pid://127.0.0.1:8888:0:8 "logger"
Sat Dec 5 11:44:28 UTC 2015 - MxReceived pid://127.0.0.1:8888:0:9 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ACKlogger\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec 5 11:44:28 UTC 2015 - MxProcessDied pid://127.0.0.1:8888:0:9 DiedNormal
Sat Dec 5 11:44:28 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:10
Sat Dec 5 11:44:28 UTC 2015 - MxReceived pid://127.0.0.1:8888:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\fbuild_server\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:9999:0u\246\US\DC4\NUL\NUL\NUL\n" :: (53cceda13633b212,b80146510d3a0638)
Sat Dec 5 11:44:28 UTC 2015 - MxSent pid://127.0.0.1:9999:0:10 pid://127.0.0.1:8888:0:10 [unencoded message] :: Message [Char] [Char]
but it's missing in the failing case, looks similar otherwise:
...
Sat Dec 5 11:45:47 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:8
Sat Dec 5 11:45:47 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:9
Sat Dec 5 11:45:47 UTC 2015 - MxRegistered pid://127.0.0.1:8888:0:8 "logger"
Sat Dec 5 11:45:47 UTC 2015 - MxReceived pid://127.0.0.1:8888:0:9 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ACKlogger\SOH" :: (5c165eac86be1c65,67e664e295ebd8aa)
Sat Dec 5 11:45:47 UTC 2015 - MxSpawned pid://127.0.0.1:8888:0:10
Sat Dec 5 11:45:47 UTC 2015 - MxProcessDied pid://127.0.0.1:8888:0:9 DiedNormal
Sat Dec 5 11:45:47 UTC 2015 - MxReceived pid://127.0.0.1:8888:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\fbuild_server\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:9999:0\DLE\206\GS\218\NUL\NUL\NUL\n" :: (53cceda13633b212,b80146510d3a0638)
So it's seems non-trivial to solve this.
Ok I'll take another look this week and see if I can discern why the send us t taking place...
Isn't *
Thanks for taking time. I played around with the example on a recent ubuntu with faster CPU and did not manage to reproduce it, even with larger payloads (1MB -> 100MB). Not sure if environment might be a factor.
Quite possibly, we could be looking at a bug in library code (network transport, or even network itself, or something else) that is environment specific.
Let me have a think about how best to debug/instrument this...
Okay I've been unable to reproduce on a linux vm, but I still haven't quite nailed the instrumentation issue. I'm thinking it might be worth enabling some debugging features in these libraries that write additional trace information, since that would allow us to see where the failure resides. I'll look into that too, though it might take some time as my development environment needs rebuilding.
So... I think one possibility is that the server dies between invocations of the client, or between the lookup and send. I'm adding some instrumentation to the server framework, so we should be able to see where things are potentially going wrong. I'll keep you in the loop.
So sorry this has taken months to get around to. I've only just started getting back into actively maintaining these libraries and it's been a slow ramp-up.
So do you see error messages print for the failing call? If you set up a monitor for the remote server do you see it die? Or is this a failure in the client code? If you run server and client in the same node, can it be reproduced?
Because if there's a bug in the server code that prevents tail recursion, then this coukd blow up the server and without reconnect we'd not get a send to the other end from the client...
Currently all message decoding appear in the main node thread, this means that for large payloads this thread will be a bottleneck. @dcoutts had in idea for moving message decoding to the other threads, that should remove the problem, though there were no discussion that I'm aware of about if it's really easy to do and defining steps forward.
@qnikst - @dcoutts and I discussed using binary's buffering features to assist with that at one point, but not sure if that was it. But if we're saying the send might not be happening in the main thread, what would that be indicative of?
Note that the client is being initiated multiple times, each time with a fresh runtime instance, so I'm really bemused as to what's going on here - if we're essentially doing for i in $(seq 1 10); do stack exec ...
when executing the clients, why would a send not take place? Note that the calling code in this library relies on sendTo
from the Addressable
type class, and the instance for ProcessId
uses send
, whilst the instance for String
uses nsend
, so this could be a name registration problem?
Also, would the OP mind re-running this with the logging to the console, so we can see all the events separately. I note that the client logs you've pasted appear to be reporting management/trace events for both the client and server nodes, and it's not clear to me whether the client send events we're seeing are for the initial or secondary send....
Wait... @qnikst don't you mean message encoding? How can message decoding appear in the main thread? Do we mean conversion from bytes to Message
occurs in the mode controller's thread? Is there a good reason why we should do that there?
Based on discussion with @dcoutts today - this happens only for channels, and he is willing to prepare patches for that. Also we found a problem in CQueue impl is that message will be decoded on each check, I'll prepare an issue for that today and will try to implement fix as soon as possible.
That sounds sensible, and thank you for all the detailed investigation on these matters. I'm going to concentrate on fixing the tracing/management events, then focus on this ticket when I get a bit more free time.
@qnikst did we make any progress on moving the decoding into the process' own threads?
Hello wonderful people contributing to this project :)
I'm writing a distributed compiler using the client/server abstraction and ran into an issue with calls to safeCall (or other versions of call) not showing up on the server side.
Here is a reduced version (literally made by slicing down the existing project) capable of reproducing the issue: https://github.com/davnils/kool/tree/testcase The pinned down dependencies are based on version in this post: https://groups.google.com/forum/#!topic/parallel-haskell/x3y6XNpTEFw
Basically there are executions of the client that end up stalling on call, which seems to occur with greater frequencies for larger payloads. I have reproduced the failing test case on osx but seen similar behaviour on ubuntu.
Any ideas on what could be wrong? Would appreciate any pointers or ideas on debugging the internals.