Closed MLnick closed 8 years ago
I'm running everything locally with Spark master local[4]
. Also using Spark 2.0.0 (not that this should make any difference).
This is interesting, I wonder why the actor becomes disassociated... This normally only happens when an actor goes completely dark and becomes unresponsive (e.g. JVM crash, system crash, etc.).
Perhaps when a pull/push request becomes so large the actor has to spend too much time processing it, so it can't respond to regular system messages (e.g. heartbeats) and is essentially "dark". Although you should really get a "PullFailedException" with a timeout, so there is something interesting going on here...
@rjagerman I got things working by using GranularBigMatrix
for my large vector. So it does appear that the issue is the message size is < max frame size, but the size of the push message is still causing an issue perhaps overwhelming the actor somehow.
It is strange that there is no PushFailedException
rather than this disassociation... let me know if you get a chance to take a look. I will try to dig deeper into the code path too.
Opened #58 to add GranularBigVector
Closing as the issue can be addressed using a granular vector/matrix.
@rjagerman when doing some localhost testing I've run into timeout issues during push/pull of a vector.
When the vectors are relatively small (like ~100k elements) everything seems to work fine. When I've tried larger feature dimensions (
1,355,191
elements for news20.binary libsvm dataset) things start timing out.At the beginning of an iteration in
rdd.foreachPartition
I do:After turning on more logging in
client.akka.loglevel
I see the following:Any idea on cause? I would expect the vector array size to be around 10.8mb. I've set the max frame and send/rec buffer sizes to 32mb and launched 1x Master and 2x Servers with this config as well as my client app with same config (based on the one you provided earlier).
Strangely, I'm not seeing exceptions like in #48