project-rig / rig

A collection of tools for developing SpiNNaker applications.
GNU General Public License v2.0
4 stars 0 forks source link

Use of N:M (multi-source, multi-sink) nets (possibly a severe issue) #153

Open mossblaser opened 9 years ago

mossblaser commented 9 years ago

The problem

After thinking a little following @mundya's #143 which makes the routing table generator merge routes which share keys and masks. I'm presuming that this above change was made in an effort to support N:M (multi-source, multi-sink) nets within Rig's net model, rather supporting some black-magic post-processing stuff and so I'll just consider N:M nets here.

Rig's "Net" model of multicast routes is that they are 1:M (single-source, multiple sink). Further the assumption is made that keys/masks are globally unique, one per net, and completely unambiguous in all situations (which use of BitFields can help ensure). This model has the nice property that it can straight-forwardly map to valid routing tables in the general case. (As an aside: this set of restrictions isn't clear in the docs and needs emphasising).

N:M nets can clearly be a useful thing to have: consider a case one population of neurons, a, is connected all-to-all with a second population, b. Here an N:M net could be used whose N sources were the processors simulating a and M sinks are the processors simulating b.

Clearly, such a net can exist in SpiNNaker, for example here's an example where the two populations are split across two processors each and reside on separate chips:

nm net

Indeed, this route could be generated by using two 1:M nets with the same key (after #143, anyway). There are plenty of others examples you could concoct where this approach also works correctly.

Unfortunately, N:M nets with one key/mask are not possible in the general case on SpiNNaker. Consider the following concrete example:

nm broken

Here the two populations a and b are split into two processors each but this time there's an a and b processor on each chip. On the top chip, a route going South and to the b core would be created. On the bottom chip, a route going North and to the core with b would be created. The result is that any packet in the net will be infinitely bounced back-and-forth between the two chips creating duplicate packets for the two b cores and generally killing the network.

What next

This is by no means an unknown problem and, as far as I'm aware, something all existing software has had to solve independently, by hand. Unfortunately it is cripplingly easy to fall silently into this trap accidentally with Rig, especially following #143. This duplication of effort and also the ease-of-silent-failure in Rig seems like a Bad-Thing (TM). The following solutions come to mind:

  1. Support general N:M nets as the underlying primitive abstraction in Rig. To do this Rig's Net abstraction would need a major re-work requiring far more involvement in the key-allocation process and also complicating the routing process a little.

    As such I'm rather against this since it adds a lot of complexity for what isn't an equally major expense. It will also substantially break backward compatibility and a huge amount of implementation work. Volunteers are welcome of course...

  2. Support N:M nets where all N sources are placed on the same chip. A new N:M net type would be defined which, in the initial problem description, acts as a general N:M net. The placement and allocation steps should not require any significant work to support such a net. After placement, a new processing step would split any N:M nets with source vertices placed on multiple chips into a new set of N:M nets for which all source vertices are on the same chip. The routing algorithm can then treat these N:M nets with all N sources on a single chip the same way as a 1:M net is currently treated.

    The user-application is then burdened with assigning unique keys/masks to any nets created as a result of splitting up N:M nets, a process which should be straight-forward. An additional side-effect of this process is that a larger number of routing entries may be produced than a 'smart' general N:M router. Though such entries should be very readily compressible using @mundya's algorithm, the routing algorithm would inherently operate with less information than a true N:M router and therefore may generate less efficient routes in general.

    This approach would be a modest amount of work for a pretty good (though not perfect) result. It could even be done in a completely backward-compatible way if required. (It would probably be best done by making the base Net type an N:M net breaking backwards compatibility very slightly.)

  3. Revert #143 and add a set of sanity checks to the build_routing_tables which cause the process to fail where nets with the same key/mask cross. Obviously this doesn't fully enforce the Rig Net model (e.g. it won't catch ambiguous but different keys/masks) but it would help things fail-early. Applications would be required to create N 1:M nets in place of a N:M net.

    This is clearly the easiest approach and doesn't (majorly) break backwards compatibility. It does however come at a relatively significant cost in terms of key space and routing entries.

Opinions are sought from both @mundya and @neworderofjamie on this matter since I imagine this is something you both have had to deal with in some form or another in your respective codebases. In particular I'd be interested to know what your existing solution is and whether you even consider this worth making Rig's problem? Either way, I presume you'd agree that making more noise if you violate Rig's Net model (in places where a Rig utility function like build_routing_tables depends on it) would be a good thing?

mundya commented 9 years ago

My preference would probably be for 2, given that that's basically the process Nengo/SpiNNaker performs. I'd be very unhappy with 3 because it would require major reworking in Nengo. This has a feel, to me at least, of a "this is a fundamental part of using SpiNNaker, developers should think about this" problem rather than a problem that we can just provide a solution for... definitely prepared to be corrected on that!

I definitely think that more documentation would be good, and for the moment until we have a solution that documentation should explicitly describe this problem.

mossblaser commented 9 years ago

My preference is option 2 as well (as you might guess). How strong are your feelings about the potential for lost efficiency over going with option 1? My gut feeling is that if you're hard-core enough to do that, you should be hard-core enough to not use the Rig abstractions.

mundya commented 9 years ago

How strong are your feelings about the potential for lost efficiency over going with option 1? My gut feeling is that if you're hard-core enough to do that, you should be hard-core enough to not use the Rig abstractions.

I'd agree, I think.

(Coincidentally, the Nengo/SpiNNaker code for this is https://github.com/project-rig/nengo_spinnaker/blob/master/nengo_spinnaker/partition_and_cluster.py#L136-L219)

mossblaser commented 9 years ago

(Coincidentally, the Nengo/SpiNNaker code for this is https://github.com/project-rig/nengo_spinnaker/blob/master/nengo_spinnaker/partition_and_cluster.py#L136-L219)

I found this while looking around to make sure #143 hadn't caused something invalid to appear in nengo_spinnaker. I should hasten to add I was not surprised to discover that it hadn't and that the solution was very pleasant! ;)

Do you recon building option 2 will result in further simplification to nengo_spinnaker or would it be a nuisance?

Incidentally, I asked the PACMAN folk what their approach was out of curiosity and theirs apparently boils down to something option 3-like.

mundya commented 9 years ago

Do you recon building option 2 will result in further simplification to nengo_spinnaker or would it be a nuisance?

I would guess that it would make nengo_spinnaker simpler. Which is fine by me!