twitter-archive / torch-ipc

A set of primitives for parallel computation in Torch
Apache License 2.0
95 stars 28 forks source link

Adding implementation of channels to torch.ipc #42

Closed nybbles closed 7 years ago

nybbles commented 7 years ago

Why?

Channels are a simplification of the workqueues already implemented in torch.ipc. A torch.ipc workqueue has two queues - one that is used to send items to workers and the other that is used to send back results to the single owner thread. A channel just has a single queue that you can enqueue or dequeue from.

Channels can be used to specify concurrent workflows in a far more flexible way than workqueues. For example, a unit test is included that contains an example of a workqueue implemented with N workload generators and M workers that does not use explicit control signals. Instead, the fact that a channel has been closed or has been drained is used to stop workers. There is also a unit test which shows how local model parallelism for forward inference can be implemented using channels.

What?

Channels are a thread synchronization primitive based on message-passing. Threads communicate via channels by writing messages onto them and reading messages out of them, in FIFO order. There is no restriction on which threads or how many threads can read or write from a channel.

Channels can also be closed, which prevents further writes to it. Once all items are read from a closed channel, that channel becomes drained and nothing further can be read from it. DAGs of computation made up for channels can be shut down via cascading closing/draining of channels.

The semantics of ipc.channel is based on:

  1. https://gobyexample.com/channels
  2. https://tour.golang.org/concurrency/2
  3. https://github.com/clojure/core.async/blob/master/examples/walkthrough.clj

Behaviors not implemented

I tried to make this change as non-intrusive as possible while still providing basic functionality. Therefore, there is a bunch of duplicated code between ipc.workqueue and ipc.channel and there are some pieces of functionality that were not implemented, specifically:

  1. It is not possible to specify the max number of items that can be written to a channel. The channel just grows to allow the write, instead of blocking on write. Therefore, just like workqueue, there is no backpressure mechanism.
  2. There is no select call to select between a number of channels.

Next steps

The following are possible next steps:

  1. Make ipc.workqueue use ipc.channel internally, instead of having duplicated code.
  2. Make it possible to specify the max number of items that can be written to a channel.
  3. Implement select call.

Also, another issue that can happen for both workqueues and channels is that you can cause a memory leak by losing a reference to a workqueue or channel that still has items on it. That should be fixed.

clementfarabet commented 7 years ago

Thanks for the nice description/context!

Seems like the build is failling, can you check: https://travis-ci.org/twitter/torch-ipc/jobs/190464716 ?

nybbles commented 7 years ago

Okay now it's green :)

nicholas-leonard commented 7 years ago

@nybbles Can you add some documentation in doc/channel.md? The above is a great preamble. But the doc should include one or two examples.

More generally, ipc.channel is awesome.

nybbles commented 7 years ago

@nicholas-leonard added some docs

nybbles commented 7 years ago

Here's a link to them: https://github.com/nybbles/torch-ipc/blob/channels/doc/channel.md