Adding "Product Unit" neurons (or layers)

mtewes commented 7 years ago

There is quite a bit of litterature on multiplicative networks, and it sounds very good so far, nice! One paper I like a lot is: http://sci2s.ugr.es/keel/pdf/specific/articulo/Schmidtt%20on-the-complexity-of.pdf

Will work on this in a dedicated branch #14

mtewes commented 7 years ago

According to Leerink et al. (1995) page 2, the initialization is important and should not be done with small values around 0.0, and local minima get worse. Let's see.

mtewes commented 7 years ago

Python + numpy implementations of the product unit layer are done, but they are probably much slower than the equivalent "sum" layers. We might soon need SWIG here. Now writing a simple test that "learns" the multiplication of 2 numbers...

mtewes commented 7 years ago

Nets can now include product layers, training remains unchanged so far. [-3,3] means the first hidden layer is composed of 3 product units. Now working on that test...

mtewes commented 7 years ago

Learning z = x * y with 1000 examples. The prediction error is shown as color.

with 2 layers of 5 conventional neurons (51 parameters), 30 seconds of training:
and with a single hidden product neuron, randomly intialized, 0.5 seconds of training:

Note the smaller color scale of the latter. Seems to work, and we can also confirm that sum-units are badly suited for multiplications.

The "report" is also automatically adapted, after training it looks like

[2|1/*iden|1/iden=5]
Layer 'h0', mode mult, ni 2, nn 1, actfct iden:
    output 0 = iden ( prod (input ** [ 0.99999963  0.99999962]) + 0.134301267119 )
Layer 'o', mode sum, ni 1, nn 1, actfct iden:
    output 0 = iden ( input * [ 0.99999999] + -0.134301381374 )

kuntzer commented 7 years ago

That does look nice!

mtewes commented 7 years ago

I pushed the demo code, demo/test_mult_learn.py

One big issue to think about is the "normalization", the issues with negative values fed into "product units", and the choice of activation functions that get "fed" into later product-unit layers. It's a whole new world to explore, but I would love to not explore it for too long :) I guess it would be good to keep the "signs" of ellipticity components somehow, i.e. use a "-11" norm, and find a hack so that it works with the product units. Using a "01" normer in front of a product unit would be a shame. We precisely want those units to be able to rescale numbers which might be positive or negative by multiplying their amplitude. Maybe some power-raising that artificially keeps the sign could work (and reinventing maths is fun).

mtewes commented 7 years ago

Routes I'm trying:

Algebraic requirement: dealing with negative inputs. From {Durbin:1989kc} :

We want to be able to consider negative inputs because the non-linear characteristics of product units, which we want to use computationally, are centered on the origin.

Approaches:

1) A bit of a hack: make the product units always return a value with the same sign as their first input, using np.sign. All the exponentiation and product stuff works on np.fabs(inputs). This makes their behavior "odd" around 0.0 of the first input. But it's a bad idea, as ouputs from all neurons will have the same sign (i.e., that of the first input). This direction can only be explored when mixing layers with sum and prod neurons (so that the sum neurons can carry around non-sign-polluted information. We need the possibility to pass a simple identity to the next layer, for some neurons.

2) Using the product of the sign of all inputs is not a good idea, as useless noisy inputs (which would get zero power) would still mess up. Somehow only the signs of inputs with "significant weight" should matter. Maybe this can be coded. [edit: did this! This will now need some "identity" initial settings]

Later: add a "sa1"-type to Normer (for "scale abs to 1"). What it does is to scale the data using only a positive multiplicative factor, so that the absolute value of the output is always smaller or equal to 1: x /= max(abs(x)). This puts "g1" between -1 and 1, while avoiding to put the "mean rad" or "mean flux" close 0.0. In other words it preserves the sign. Aim is to avoid the need for a "sum"-layer (with bias offsets) in front of the "product"-layer.
Later: change use of biases. Without activation function, they are useless anyway [edit: not true unless deep. Additive biases could be useful in case of small networks]. With an activation function, multiplicative biases might be way better than additive ones, as product-units can not use their weights to scale the product nicely within their activation function. Do not forget to update setidentity in this case!

mtewes commented 7 years ago

Todo in order to test the point 2. above:

[x] - implement the "sa1"-normer
[x] - implement some clever initial settings for product unit layers (maybe this is bonus, will first try without).

Todo next:

[x] inspect why setidentity stops bfgs from iterating. Edit: found it, related to layer.weights = np.zeros() instead of *= 0.0, which broke the reference. Fixed, and added a comment to the code explaining this.
[x] implement choice of act for mult layers: New option multactfctname, default="iden"
[ ] experiment with "bias" of mult layers (e.g., kill it, for now). Edit: it's killed, for now.

mtewes commented 7 years ago

About the initial "identity" settings, there is another interesting aspect: so far, for a network with a single output, each layer was only set to "transport" the first input, using its first neuron. All other neurons started from zero. It could be much more interesting for each i-th neuron of each layer to transport the i-th input. Given that we train layers starting by the end, the last layer would directly see "all inputs". Of course, if you have more hidden nodes than inputs, you'll still have some "joker" neurons starting from zero. Will implement this in a configurable way, to leave the choice to "transport" only the first n inputs or to transport all inputs as far as possible into the network.

[x] New option onlynidentity, default=None (Warning, this is different from the previous default, which was the number of network outputs!)

mtewes commented 7 years ago

About initial noise in the "weights" (== exponents) of mult-layers:

[x] new option ininoisemultwscale, default = 0.1 (only affects nets with mult layers, of course).
[ ] experiment with not adding noise here, or adding fixed exponents, say 0.1.

mtewes commented 7 years ago

I'll experiment with restricting the multiplication-layer to positive powers, i.e., avoid divisions.

mtewes commented 7 years ago

Closing this, it works. Next upgrades will be done in a new issue and branch.

mtewes / tenbilac

Adding "Product Unit" neurons (or layers) #14