Split blocks automatically

Loihi is organized into cores with a fixed number of compartments on each, and since the start we've required users to manually break their model into Ensembles that will each fit on one core.

This PR automates splitting larger ensembles to fit across multiple cores. This allows users to create the model in terms of structures that work well conceptually, and worry less about how that is going to map to Loihi.

There are two ways of using this functionality. One is to let nengo-loihi figure things out itself, in which case it simply splits large ensembles sequentially (putting the first N neurons on one core, the next N on the next core, etc.). This works well for NEF or other fully-connected Ensembles that have a fairly uniform structure in terms of input and output connections.

The second way to use block splitting is to provide instructions on how you (as a user) want an ensemble to be split. This is done via the full_shape and block_shape config options.

with nengo.Network() as net:
    a = nengo.Ensemble(120)
    net.config[a].full_shape = (6, 5, 4)
    net.config[a].block_shape = (3, 2, 3)

The block_shape specifies the shape that a single block (i.e. one core) will represent. The maximum number of compartments on that core is the product of all numbers of the shape. We then tile that shape to fill the full shape. So in the above example, we'll have 2 cores in the first dimension (since 6 \ 3 = 2, where \ represents ceiling division ceil(a / b)), 3 cores in the second dimension (5 \ 2 = 3), and 2 cores in the third (4 \ 3 = 2). The total number of cores is 2 * 3 * 2 = 12, and the layout of the cores is (2, 3, 2). We then "rebalance" the block_shape so that it is as uniform as possible across cores, given this layout, by taking the ceiling division of each element of the full shape by the corresponding number of cores in that dimension: (6, 5, 4) \ (2, 3, 2) = (3, 2, 2). You can see this is close to the original block shape, but with the last dimension being 2 instead of 3. What's happened is that in the first dimension, we'll have 2 cores of length 3 go evenly into 6, so everything is already balanced there. In the second dimension, we'll have 2 cores of length 2 and one of length 1 to make up 5. This isn't perfectly balanced, but there's no way to make one core shorter and another longer to have it be better. But in the last dimension, with the original block shape we would have one core of length 3 and one of length 1 to make up the 4. We can instead have two cores of length 2, which is more balanced (and doesn't use any more cores), so we do that instead.

This PR also adds a number of features that we needed for a recent project.

"Make transform builder pluggable": This allows users to write custom builders for their own transforms. At this point, they still have to re-write the whole connection builder function (though of course they can use one of ours as a template), just as we have a completely different function for building connections with Convolutional transforms vs those with Dense/Sparse transforms.
"Move model validation to simulator.py": This allows validation to be turned off if desired, and follows naturally from block splitting. By default, large ensembles will be split and the whole model will be validated. However, this may be detrimental to performance in the emulator, where the large blocks do not need to be split. To allow this to work, validation must be turned off, so that the large blocks to not result in an error.
"Added GreedyChip allocator": This is our second multi-chip allocator. The first---the RoundRobin allocator---alternates between chips when placing blocks, which is great for testing, but results in extra inter-chip communication for models where nearby blocks are more likely to be connected or to get input from a common source (which is typical of most models, particularly when we introduce block splitting). The GreedyChip allocator fills one chip first, before moving to the next chip. It allows a maximum number of cores per chip to be specified, so chips do not need to be fully utilized.
"Nengo IO SNIP uses spike packing": Currently, when probing spikes in the IO SNIP, we get voltages for all neurons, and then check on the superhost if this is equal to a magic number to determine if a neuron has spiked. This moves that check to the IO SNIP, so that rather than transmitting back a 32-bit voltage for each neuron, we just send a 1-bit spike, resulting in reduced memory usage in the IO SNIP and reduced data transfer back to the superhost.
"Allow initial synapse index bits to exceed max": We were checking whether the synapse index bits exceeds the maximum when we created the synapse, which is problematic for creating large ensembles that later get split. Now, we put in a placeholder of -1 if we exceed the max, and check that these are no longer present when we do validation (indicating large blocks have been properly split). This commit could be merged into one of the block splitting commits.
"Better reporting of board connection errors": We currently log board connection errors, but if logging is off (default), the user gets no feedback about why we couldn't connect to the board. So make this part of the exception we send to them. Also, only try to connect 3 times by default instead of 10, since often connection fails because of a problem with the model, which won't get fixed by repeated attempts.
"Any exception fails TensorFlow and NengoDL import": Importing versions of these that we don't support can sometimes result in errors other than ImportErrors (like AttributeErrors). So if we get any error when trying to import, just mark it as not available.
"Represent inputs at board level instead of core": Inputs can input to multiple cores, even across multiple chips. Rather than storing them at the Core level in our model, store them at the Board level instead. Makes the allocators cleaner.

Based on #261.

TODO:

[ ] What happens if np.prod(full_shape) != ensemble.n_neurons?

nengo / nengo-loihi

Split blocks automatically #264