htm-community / comportex

Hierarchical Temporal Memory in Clojure
153 stars 27 forks source link

lazy-growing synapse graph #21

Open floybix opened 9 years ago

floybix commented 9 years ago

HTM model creation can be extremely slow. The time goes into creating the huge proximal synapse graphs containing all potential connections.

The problem of explicitly representing full potential synapse graphs is more acute in higher level layers because their input -- from cell layers -- is extremely sparse: column activation of 2% with depth 20 = 0.1% (except when bursting). With such sparsity, each column needs a lot of synapses in order to reach a reasonable stimulus threshold: to reach 5 active synapses, an average of 5000 random synapse connections are needed.

This is mitigated to some extent by the learning mechanism which can grow additional synapses directly to the active inputs (same mechanism as on distal dendrite segments), but we still need a reasonable degree of initial connectivity to activate columns in the first place.

First proposal

Lazy creation would only happen while previously unseen input bits continued to appear. But random growth and death of synapses could also continue indefinitely (either eagerly or lazily), giving a boosting effect.

cogmission commented 9 years ago

I know Jeff once described it as having the consistency of tapioca, but are there any papers which describe biologically what happens with interregional communication that could perhaps provide a hint?

Sent from my iPhone

On Jun 26, 2015, at 11:42 PM, Felix Andrews notifications@github.com wrote:

HTM model creation can be extremely slow. The time goes into creating the huge proximal synapse graphs containing all potential connections.

The problem of explicitly representing full potential synapse graphs is more acute in higher level layers because their input -- from cell layers -- is extremely sparse: column activation of 2% with depth 20 = 0.1% (except when bursting). With such sparsity, each column needs a lot of synapses in order to reach a reasonable stimulus threshold: to reach 5 active synapses, an average of 5000 random synapse connections are needed.

This is mitigated to some extent by the learning mechanism which can grow additional synapses directly to the active inputs (same mechanism as on distal dendrite segments), but we still need a reasonable degree of initial connectivity to activate columns in the first place.

First proposal

Lazy creation of the proximal synapse graph: synapses are only created upon the first activation of each source bit. That would be equivalent to the current behaviour except that lazy synapses would not be decremented until they come into existence. We could bias the new synapses towards neglected columns, achieving boosting and also partially adjusting for the above point. Second proposal

Lazy creation would only happen while previously unseen input bits continued to appear. But random growth and death of synapses could also continue indefinitely (either eagerly or lazily), giving a boosting effect.

— Reply to this email directly or view it on GitHub.

floybix commented 8 years ago

Boosting causes representations to be unstable, and to the extent they are unstable they are meaningless. I usually turn it off. I wonder if instead we could use the mechanism that we have for distal synapses (selecting winner cells in a column), but applied to proximal synapses (selecting columns):

mrcslws commented 8 years ago

Thinking of a tall hierarchy, it's interesting to think about how this would change things. Starting with an untrained model, the first region would start activating. Then the second. Then the third. And so on.

Currently, with random initial connections, the entire hierarchy might light up on the first input. Every region will do proximal/distal/apical learning right from the start, shaping a pile of random connections into something meaningful. With this new approach, it'd be more of a blank slate.

floybix commented 8 years ago

Thinking of a tall hierarchy, it's interesting to think about how this would change things. Starting with an untrained model, the first region would start activating. Then the second. Then the third. And so on.

Actually that's not obvious to me. I thought all layers should activate cells even if they don't have pre-existing proximal synapses - that is, even if those columns/cells are chosen randomly (and will then grow new synapses).

Maybe you mean, should we grow proximal synapses to bursting cells, or only to predicted cells? I'm leaning toward the former, given that first-level layers do not have predicted input (only sense input), but they still grow proximal synapses. The learning rate to predicted cells could be higher though.

On the other hand, it may not make much sense to learn a bursting signal since once the stimulus is learned/predicted in a lower layer it will have a different representation. But I think that could be OK if we have a low/slow learning rate. This paper (via Joseph Rocca) describes cortex as slowly learning to capture statistical properties of the world, in contrast with, and complementing, Hippocampus learning much faster: http://psych-www.colorado.edu/~oreilly/papers/OReillyRudy00_hippo.pdf

floybix commented 8 years ago

My experiments so far have shown that it is fatal to grow new proximal synapses directly to active sources. It results in column sets taking over -- masking -- multiple inputs particularly if there are subset / overlap relationships between inputs. I guess a solution would be to enforce a unique sub-sampling of "potential synapses" on each column; i.e. some sort of local topographic radius, even if the inputs are not meaningfully topographic: even if the inputs bits are in fact randomly shuffled.

Here's a completely different approach to the problem of boosting / decorrelating representations. Leabra's XCAL BCM rule is based on comparing the short term and long term average activations to apply a homeostatic stabilisation: https://grey.colorado.edu/CompCogNeuro/index.php/CCNBook/Learning/Leabra

the BCM contrast or normalization is all about the receiver long-term average activity y_l, with the sending activity serving as the "conditioning" variable -- you only update the weights if the sending unit is active, and conditioned on that, compare the current receiver activity relative to the long-term average.