rjagerman / glint

Glint: High performance scala parameter server
MIT License
168 stars 67 forks source link

Start ParameterServer threads on workers #1

Closed rjagerman closed 8 years ago

rjagerman commented 8 years ago

The parameter server will need to be started on certain machines in a cluster. The first step is to get an Akka listener working on machines and provide some form of stand-alone and easy deployment.

rjagerman commented 8 years ago

This is the first outline on how to do this:

  1. Each worker machine starts a JVM (e.g. through SSH or some cluster manager) running the ParameterServer run method. This method starts an (empty) actor system
  2. We start a ParameterManager JVM on the master machine
  3. When requested the ParameterManager spawns "ParameterServer" actors on the parameter servers using Akka remoting
  4. The parameter manager returns the ActorRef references to the remote parameter servers which can then be used in any way by the user's code (e.g. to "pull" and "push" model slices)
rjagerman commented 8 years ago

After some consideration this is the current implementation:

  1. We start an instance on a machine running in master mode (Master.run)
  2. Each worker machine starts a JVM and runs in server mode of the application (Server.run)
  3. The user code can construct a client object by providing proper configuration (new Client(config)). This client provides an entry point to the master server and is serializable.
  4. The user code can construct large models distributed over the servers by calling implementations of glint.models.BigModel. The first implementation using a basic Array as a data structure is implemented in glint.models.array.ArrayBigModel.
  5. A user calls ArrayBigModel.create[T](...) to create a large distributed array of type T.
  6. The returned ArrayBigModel object is then used to access the data without the user's knowledge about the physical location of the parameters. One can use the pull and push methods to respectively get and set the parameters.

Closing this issue since the basic functionality through a stand-alone deployment is effectively done.