rjagerman / glint

Glint: High performance scala parameter server
MIT License
168 stars 67 forks source link

Struggling with data transfer / actor disassociated #56

Closed MLnick closed 7 years ago

MLnick commented 7 years ago

@rjagerman when doing some localhost testing I've run into timeout issues during push/pull of a vector.

When the vectors are relatively small (like ~100k elements) everything seems to work fine. When I've tried larger feature dimensions (1,355,191 elements for news20.binary libsvm dataset) things start timing out.

At the beginning of an iteration in rdd.foreachPartition I do:

val values = Await.result(vector.pull(keys), 5 seconds)
// compute gradient and update
...
vector.push(keys, update.data)

After turning on more logging in client.akka.loglevel I see the following:

[WARN] [09/21/2016 10:53:55.915] [glint-client-akka.remote.default-remote-dispatcher-5] [akka.tcp://glint-client@192.168.100.122:50080/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63032-1] Association with remote system [akka.tcp://glint-server@192.168.100.122:63032] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
[INFO] [09/21/2016 10:53:55.921] [glint-client-akka.actor.default-dispatcher-2] [akka://glint-client/deadLetters] Message [glint.messages.server.request.PullVector] from Actor[akka://glint-client/temp/$f] to Actor[akka://glint-client/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [09/21/2016 10:53:55.921] [glint-client-akka.actor.default-dispatcher-2] [akka://glint-client/deadLetters] Message [glint.messages.server.request.PullVector] from Actor[akka://glint-client/temp/$j] to Actor[akka://glint-client/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [09/21/2016 10:53:55.921] [glint-client-akka.actor.default-dispatcher-2] [akka://glint-client/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63032-1/endpointWriter] Message [akka.remote.EndpointWriter$AckIdleCheckTimer$] from Actor[akka://glint-client/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63032-1/endpointWriter#1446622162] to Actor[akka://glint-client/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63032-1/endpointWriter#1446622162] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [09/21/2016 10:53:55.923] [glint-client-akka.actor.default-dispatcher-3] [akka://glint-client/deadLetters] Message [akka.remote.RemoteWatcher$Heartbeat$] from Actor[akka://glint-client/system/remote-watcher#-1279985067] to Actor[akka://glint-client/deadLetters] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[WARN] [09/21/2016 10:53:55.925] [glint-client-akka.remote.default-remote-dispatcher-13] [akka.tcp://glint-client@192.168.100.122:50080/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63204-2] Association with remote system [akka.tcp://glint-server@192.168.100.122:63204] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
[INFO] [09/21/2016 10:53:55.926] [glint-client-akka.actor.default-dispatcher-2] [akka://glint-client/deadLetters] Message [akka.remote.RemoteWatcher$Heartbeat$] from Actor[akka://glint-client/system/remote-watcher#-1279985067] to Actor[akka://glint-client/deadLetters] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [09/21/2016 10:53:55.927] [glint-client-akka.actor.default-dispatcher-2] [akka://glint-client/deadLetters] Message [glint.messages.server.request.PullVector] from Actor[akka://glint-client/temp/$k] to Actor[akka://glint-client/deadLetters] was not delivered. [6] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [09/21/2016 10:53:55.928] [glint-client-akka.actor.default-dispatcher-2] [akka://glint-client/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63204-2/endpointWriter] Message [akka.remote.EndpointWriter$BackoffTimer$] from Actor[akka://glint-client/deadLetters] to Actor[akka://glint-client/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63204-2/endpointWriter#-1245711256] was not delivered. [7] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [09/21/2016 10:53:55.932] [glint-client-akka.actor.default-dispatcher-2] [akka://glint-client/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63032-1/endpointWriter] Message [akka.remote.EndpointWriter$BackoffTimer$] from Actor[akka://glint-client/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63032-1/endpointWriter#1446622162] to Actor[akka://glint-client/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fglint-server%40192.168.100.122%3A63032-1/endpointWriter#1446622162] was not delivered. [8] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [09/21/2016 10:53:56.292] [glint-client-akka.actor.default-dispatcher-3] [akka://glint-client/deadLetters] Message [akka.remote.RemoteWatcher$Heartbeat$] from Actor[akka://glint-client/system/remote-watcher#-1279985067] to Actor[akka://glint-client/deadLetters] was not delivered. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [09/21/2016 10:53:56.292] [glint-client-akka.actor.default-dispatcher-3] [akka://glint-client/deadLetters] Message [akka.remote.RemoteWatcher$Heartbeat$] from Actor[akka://glint-client/system/remote-watcher#-1279985067] to Actor[akka://glint-client/deadLetters] was not delivered. [10] dead letters encountered, no more dead letters will be logged. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
16/09/21 10:54:01 ERROR Executor: Exception in task 1.0 in stage 28.0 (TID 59)
java.util.concurrent.TimeoutException: Futures timed out after [5 seconds]

Any idea on cause? I would expect the vector array size to be around 10.8mb. I've set the max frame and send/rec buffer sizes to 32mb and launched 1x Master and 2x Servers with this config as well as my client app with same config (based on the one you provided earlier).

Strangely, I'm not seeing exceptions like in #48

MLnick commented 7 years ago

I'm running everything locally with Spark master local[4]. Also using Spark 2.0.0 (not that this should make any difference).

rjagerman commented 7 years ago

This is interesting, I wonder why the actor becomes disassociated... This normally only happens when an actor goes completely dark and becomes unresponsive (e.g. JVM crash, system crash, etc.).

Perhaps when a pull/push request becomes so large the actor has to spend too much time processing it, so it can't respond to regular system messages (e.g. heartbeats) and is essentially "dark". Although you should really get a "PullFailedException" with a timeout, so there is something interesting going on here...

MLnick commented 7 years ago

@rjagerman I got things working by using GranularBigMatrix for my large vector. So it does appear that the issue is the message size is < max frame size, but the size of the push message is still causing an issue perhaps overwhelming the actor somehow.

It is strange that there is no PushFailedException rather than this disassociation... let me know if you get a chance to take a look. I will try to dig deeper into the code path too.

MLnick commented 7 years ago

Opened #58 to add GranularBigVector

MLnick commented 7 years ago

Closing as the issue can be addressed using a granular vector/matrix.