GradiantDescentApproximator doesn't take advantage of quick files

The quick/_.nef files store the inverted Gamma matrix, so it's quick to compute decoders for the origins (but the actual decoder values are not stored in the quick file -- it still has to do the inv(Gamma)_Upsilon step). But this only helps for origins computed with WeightedCostApproximator. GradiantDescentApproximator can't take advantage of this, so it has to redo the optimization.

This can take a while, and is the majority of the time cost for making a basal ganglia. So even with quick files, we spend an extra 0.2seconds per rule (on my machine) re-solving for these decoders every time we re-run a script.

I'm not sure what the right solution is, but it would be nice to have something that actually stores the decoders in the quick file, so it would work for any sort of approximator.

Does this need to be done on disk? A quick solution might be to implement a cache of some kind that can store these results. You could attach the cache right to whatever class is computing the solution, so that the class can check the cache for a result before solving the inputs.

Hmm, good question. The quick files are a bit of an odd thing in Nengo -- the intent is purely to speed up re-creating a neural component that's identical to something we've already created in the past. We use this a lot when trying out building different models -- we have a script that creates a model, and then we make one change to the script and re-run it. We don't want to have Nengo re-do all the computation and optimization for the 99% of the model that hasn't changed.

So the idea is that whenever we create an ensemble, we can save a copy of it to disk with a unique filename that specifies all its parameters. These are put into the "quick" subdirectory. Importantly, this only happens if we're also specifying a random number seed, since we only want to use this file in the future if we're specifying that exact seed again. So if I do this:

import nef
net=nef.Network('Test', seed=1)
net.make('A', 100, 1)
net.make('B', 100, 1)
net.connect('A', 'B')
net.add_to_nengo()

Nengo will not only create the ensembles, but also save two files in the quick directory, one for each ensemble with its own seed (setting seed=1 on the Network tells the network to generate seeds for the subensembles: http://www.nengo.ca/docs/html/quick.html

Now, when I re-run the script, instead of re-creating those ensembles from scratch, Nengo will just load the files from the quick directory instead. For a small model like this, there's no significant time savings, but for larger ensembles it's quite significant (try changing 100 to 1000 or 5000, for example).

The difficulty is that, right now, this saving is done at when the make command is called, not when the connect is called, and connect sometimes has to do a lot of computation. For example, if I do this:

import nef
net=nef.Network('Test', seed=1)
net.make('A', 100, 1)
net.make('B', 100, 1)
def square(x):
    return [x[0]*x[0]]
net.connect('A', 'B', func=square)
net.add_to_nengo()

the connect call has to do some optimization to figure out the decoder weights needed to get the neurons to approximate the square() function. Now, Nengo's standard way of doing this optimization is to use a WeightedCostApproximator, because it turns out that if we pre-compute the inverse Gamma matrix we can very quickly do that optimization for any function. So that gets computed right when we call make, and gets saved into the quick file.

But, that's not the only way to do the optimization. The GradiantDescentApproximator is quite handy if we want to do things like impose constraints on the optimization, such as restricting all the connection weights to be positive (or negative). We use this right now in the basal ganglia model, by doing things like net.make('A', 100, 1, decoder_sign=1). This forces all future connect() calls from that ensemble to use a GradiantDescentApproximator with only positive connections. But, now the quick files don't help, and we need to re-do that exact same optimizations every time we run the script.

Interestingly, we know what approximator we're using when we call make(), so it might be possible to set something up right then for any future connect() calls, but I'm not sure what the right approach would be. I'm not sure what the trade-offs would be for an in-memory versus on-disk cache -- I do think we would still want an on-disk cache, to support re-running a script the next day, for example.

Overall, this is a pretty low-priority item, since, as far as I'm aware, the basal ganglia is the only model currently using the GradiantDescentApproximator. (Admittedly, the basal ganglia is a key part of all of our large models, though). That said, one of the big things we're finding is that our larger models are taking a long time to re-create, and we're not sure why. With things like the quick files, it should be pretty instantaneous to re-run a script and have it just load up all the components. But it actually can take a long time, so I think it'd be useful to figure out where that time is being spent. I hadn't realised that the basal ganglia models had to do this recomputation aspect -- I was quite surprised to find that this simple script takes 10 seconds to re-run:

N=50
D=50

import nef

net=nef.Network('func test', fixed_seed=1)
net.make_array('A', N, D, decoder_sign=1)
net.make_array('B', N, D)
def function(x):
    return [-x[0]]
net.connect('A','B',func=function)

and the time needed increases linearly with D. If we were doing it right, D shouldn't matter at all, since every single one of those connections between those arrays is identical to the previous one, so it should just have to do the optimization once.

(Note that the above example is another use case for quick files: 'A' is an array of 50 ensembles, each of which has 50 neurons in it, and since there's a fixed_seed, each of those ensembles has the exact same seed, so consists of neurons with identical properties. So Nengo should only need to do the heavy computation for one of them, and the rest are just copies. This works in the current system because make_array() just calls make() a whole bunch of times, so it makes use of the quick file.)

An on-disk cache seems like the smart thing to do, too. In-memory caches are unlikely to save much time. The fact that you have a file system database may cause problems down the road, but because cache items can only be referred to by one key (the type and parameterization of the original model) and the key is reasonably short, it doesn't seem to be an issue.

A few related questions, then:

1) Is it correct to say that the size of this cache is related to the number of ensembles in the network? 2) Is it possible that the keys used to index the cache could become too large for the file system to store? 3) Is it possible for a single cache entry to be referred to by multiple keys?

If we're going to fix this piece, we might consider adding other fixes that may prevent problems in the future.

1) Most of the time, yes, there's going to be one file in the cache for each ensemble in the network. Doing things like setting a fixed_seed (same seed for every ensemble) will reduce that (since there may be many ensembles in the network with the same parameters), but we generally only ever do that when testing. We don't want to use fixed_seed in a real model, since having identical neurons in different ensembles is completely non-biologically plausible.

However, right now, we only have one cache for all of Nengo, so that same cache gets used if I suddenly start working on a different model. That does mean that the quick directory starts getting very large (right now mine has 2,000 files using 2.3GB), and now and then we just manually delete it. That's also part of why we don't use it when creating things in the Nengo GUI -- we don't want novice users to have to worry about their hard drive slowly filling up.

2) Hmm... hadn't thought of that. I think right now, no. The keys are computed in python/nef_core.py with this code:

            storage_name='quick_%s_%d_%d_%1.3f_%1.3f'%(storage_code,neurons,dimensions,tau_rc,tau_ref)
            if type(max_rate) is tuple and len(max_rate)==2:
                storage_name+='_%1.1f_%1.1f'%max_rate
            else:
                storage_name+='_%08x'%hash(tuple(max_rate))
            if type(intercept) is tuple and len(intercept)==2:
                storage_name+='_%1.3f_%1.3f'%intercept
            else:
                storage_name+='_%08x'%hash(tuple(intercept))
            if isinstance(radius,list):
                storage_name+='_(%s)_%1.3f'%(''.join(['%1.3f'%x for x in radius]),decoder_noise)
            else:
                storage_name+='_%1.3f_%1.3f'%(radius,decoder_noise)
            if encoders is not None:
                storage_name+='_enc%08x'%hash(tuple([tuple(x) for x in encoders]))
            if decoder_sign is not None:
                storage_name+='_sign%d'%decoder_sign
            if eval_points is not None:
                storage_name+='_eval%08x'%hash(tuple([tuple(x) for x in eval_points]))
            if node_factory is not None:
                storage_name+='_node%s'%node_factory.__class__.__name__                
            if seed is not None:
                storage_name+='_seed%08x'%seed

I guess there is a possibility if someone has Nengo in some deep directory, we might hit the windows 256-character path limit thing (I can never remember which situations that's a problem for), but right now it'd be had to get above 100 characters in the name. I could imagine adding parameters to this list, though, as we get more complex neuron models.

One advantage to not using a file name, though, would be that we wouldn't have to do the hash thing that's going on for the intercept, max_rate, encoders, and eval_points parameters. Basically, if you manually specify these as a particular value for each neuron, instead of putting all that data into the filename, we compute the hash of the data and just put that in the filename. And assume that there won't be any hash conflicts, which could definitely be a problem (that said, these are all rare things to do).

Oh, I guess another problem looking at this is that the degree of precision of some of those parameters is pretty arbitrary... with the current system we'd mistake max_rate=200 with max_rate=200.001 -- they'd have the same key, even though they're technically different models.

3) I don't think so, other than the has conflict problem noted above

Hmm.. the more I look at this, the more I think you're right that it'd be a good idea to see if we can future-proof this whole system a bit. This is also a general problem for our other implementations as well (the C/GPU implementation, the FPGA implementation, SpiNNaker, and so on). We do a lot of optimization when we create the networks, and it'd be nice to save as much of that as possible, but we don't want to have that cache cause problems, and there are cases where it might.

One situation that's come up in other implementations (not Nengo, but an old Python implementation I fiddled with a few years ago) is function definitions. For example, consider this model:

import nef

def function(x):
    return [x[0]**3]

net=nef.Network('Changing functions')
net.make_input('input',[0])
net.make('A', 100, 1)
net.make('B', 100, 1)    
net.connect('input','A')
net.connect('A','B',func=function)

net.add_to_nengo()

If I change the function and re-run the script, Nengo needs to re-compute the connection

def function(x):
    return [x[0]**4]

Right now this works great in Nengo, since the majority of the optimization time is done in the step of creating the inverted gamma matrix (which doesn't depend on the function). When the script is run, Nengo takes the given function and generates the Upsilon matrix (by calling the function a few hundred times at different x values), which is pretty quick, but does take some time for larger networks.

But, in the old Python system I was playing with, I wanted to cache the decoder value itself, so no re-computation would be needed at all. For this to work, I needed to be able to detect when the function itself changes. This is a pretty hard problem, and I don't think there's a correct solution to it, but I got pretty close by doing some introspection and making a hash based on the code string for the function, combined with the set of memory objects referenced by the code. So that way I could detect this change:

power=3   # change this to 4, and the function should be re-computed

def function(x):
    return [x[0]**power]

but it'd still be difficult to detect situations where the behaviour of the function is dependent on the contents of some other file, for example.

So, what that means is that the cache key really shouldn't be dependent on the function itself -- rather, it should be dependent on the set of data produced by sampling that function. All of the optimization methods (I think) work by taking a bunch of random x values and calling the function for each one, and making a big matrix of the results (which we call Upsilon). This is only 500 to 5000 x values, so it's not too bad to be generating each time, but it would be very nice to not have to do that.

We should probably consider revamping the whole caching subsystem to make use of some kind of key/value store rather than trying to use the file system directly. Infinispan seems like a good first choice, it's written in Java and designed to be very fast. It also supports an LRU cache strategy so we don't have to worry about a growth-without-bounds situation.

Predicting whether two invocations of a stateful function will be identical is at least as computationally complex as invoking those functions yourself, there's probably nothing we can do to help that situation. Another problem is that python has terrible referential transparency, so even determining what might be part of the function's state is likely to be a very hard problem. Case in point: Consider a function that makes heavy use of getattr, and trying to figure out statically what variables that function references. Yuck.

Also, here's another proposal: Suppose that, instead of caching upsilon, you compute a much larger set called super-upsilon and cache that. Then, if upsilon requires n members, you sample (p-1)n members from super-upsilon, and then sample pn members from the function and add those into super-upsilon. 1/p therefore becomes the marginal improvement due to the caching policy, allowing the user to tune "p" according to how much randomness they want.

I mean, using the filesystem directly is going to become a problem for a lot of reasons--the pathname length is one example where you're definitely going to see issues, but also the number of files in a directory is practically upper-bounded on most systems to just a few thousand, and a filesystem gives you basically zero distribution capability in clustered settings. What would be really awesome is if precomputed results could be shared in a clustered environment; depending on the cache hit rate, the result could be linear performance improvements in the number of simultaneous compute nodes over the behavior of an unshared caching policy.

I like where you're going with this. With a decent cache system like this (and Infinispan looks great), this cache could be easily spread across the machines in a lab, so that everyone gets advantage of it. It's not all that important for the smaller ensembles (who take milliseconds to recompute), but the larger ensembles (5000 neurons+) do take tens of minutes to compute (the main computation is the inversion of an NxN matrix -- we do have people looking at faster ways of doing that, but all these algorithms take more and more time as N increases, so a caching system will always be useful).

I would want to make sure that setting up Infinispan isn't required to run Nengo, since we want people to be able to easily install it. But I think it'd be quite handy to have as an option as people work on more complex models.

My initial vision is that it'd have a slightly different interface than our current caching. Instead of having the key be a string built out of the neural parameters plus a seed and the returned value being a serialized Java NEFEnsembleImpl object, the key would be a bunch of neural parameters, a seed, and an array of returned function values (Upsilon), and the returned value would be the decoder itself (a NxD matrix of floats). Then we would query that cache whenever we try to create a DecodedOrigin. If it's not in the cache, we compute it ourselves, and then add the resulting decoder into the cache.

I think this is the most generic way to do it. We might have a special case for the identity function, since we make that all the time (where upsilon is exactly the same as the set of points being sampled). I'm not sure about the super-upsilon approach you mentioned above, though -- there shouldn't be any randomness at all, since the sample points themselves are based on the seed.

One other interesting side effect of this sort of cache is that we can use the same cache for the Java version as for a python or fpga or whatever implementation......

I completely agree that we should not require the user to configure a database before using Nengo. Adding configurational load is sure to just cause frustration; therefore, here's what I propose regarding infinispan: We make use of a built-in infinispan instance. Nengo will be in charge of configuring it, and if there is no external configuration file specified by the user, infinispan runs in a standalone setting. If there is an external configuration file, that file can allow infinispan to join a cluster for computation sharing. That's one of the joys of using a pure Java implementation: Embedding is far simpler.

Your caching policy sounds like a sound one, especially if an idempotency guarantee can be made on all functions. I'm not sure what the equivalent term is in mathematics--basically, when you call a function with certain parameters, it always returns the same value. If users can specify PRNGs as functions then this assumption might be broken.

The reason I suggested super-upsilon is because I assumed your strategy was due to the need to approximate the function by a series of samples, and those samples were chosen pseudorandomly. If it's deterministic, then you're right, there is no need to sample a superset.

nengo / nengo-1.4

GradiantDescentApproximator doesn't take advantage of quick files #368