Closed mtewes closed 7 years ago
According to Leerink et al. (1995) page 2, the initialization is important and should not be done with small values around 0.0, and local minima get worse. Let's see.
Python + numpy implementations of the product unit layer are done, but they are probably much slower than the equivalent "sum" layers. We might soon need SWIG here. Now writing a simple test that "learns" the multiplication of 2 numbers...
Nets can now include product layers, training remains unchanged so far. [-3,3] means the first hidden layer is composed of 3 product units. Now working on that test...
Learning z = x * y with 1000 examples. The prediction error is shown as color.
with 2 layers of 5 conventional neurons (51 parameters), 30 seconds of training:
and with a single hidden product neuron, randomly intialized, 0.5 seconds of training:
Note the smaller color scale of the latter. Seems to work, and we can also confirm that sum-units are badly suited for multiplications.
The "report" is also automatically adapted, after training it looks like
[2|1/*iden|1/iden=5]
Layer 'h0', mode mult, ni 2, nn 1, actfct iden:
output 0 = iden ( prod (input ** [ 0.99999963 0.99999962]) + 0.134301267119 )
Layer 'o', mode sum, ni 1, nn 1, actfct iden:
output 0 = iden ( input * [ 0.99999999] + -0.134301381374 )
That does look nice!
I pushed the demo code, demo/test_mult_learn.py
One big issue to think about is the "normalization", the issues with negative values fed into "product units", and the choice of activation functions that get "fed" into later product-unit layers. It's a whole new world to explore, but I would love to not explore it for too long :) I guess it would be good to keep the "signs" of ellipticity components somehow, i.e. use a "-11" norm, and find a hack so that it works with the product units. Using a "01" normer in front of a product unit would be a shame. We precisely want those units to be able to rescale numbers which might be positive or negative by multiplying their amplitude. Maybe some power-raising that artificially keeps the sign could work (and reinventing maths is fun).
Routes I'm trying:
We want to be able to consider negative inputs because the non-linear characteristics of product units, which we want to use computationally, are centered on the origin.
Approaches:
1) A bit of a hack: make the product units always return a value with the same sign as their first input, using np.sign. All the exponentiation and product stuff works on np.fabs(inputs). This makes their behavior "odd" around 0.0 of the first input. But it's a bad idea, as ouputs from all neurons will have the same sign (i.e., that of the first input). This direction can only be explored when mixing layers with sum and prod neurons (so that the sum neurons can carry around non-sign-polluted information. We need the possibility to pass a simple identity to the next layer, for some neurons.
2) Using the product of the sign of all inputs is not a good idea, as useless noisy inputs (which would get zero power) would still mess up. Somehow only the signs of inputs with "significant weight" should matter. Maybe this can be coded. [edit: did this! This will now need some "identity" initial settings]
Later: add a "sa1"-type to Normer (for "scale abs to 1"). What it does is to scale the data using only a positive multiplicative factor, so that the absolute value of the output is always smaller or equal to 1: x /= max(abs(x))
. This puts "g1" between -1 and 1, while avoiding to put the "mean rad" or "mean flux" close 0.0. In other words it preserves the sign. Aim is to avoid the need for a "sum"-layer (with bias offsets) in front of the "product"-layer.
Later: change use of biases. Without activation function, they are useless anyway [edit: not true unless deep. Additive biases could be useful in case of small networks]. With an activation function, multiplicative biases might be way better than additive ones, as product-units can not use their weights to scale the product nicely within their activation function. Do not forget to update setidentity in this case!
Todo in order to test the point 2. above:
Todo next:
About the initial "identity" settings, there is another interesting aspect: so far, for a network with a single output, each layer was only set to "transport" the first input, using its first neuron. All other neurons started from zero. It could be much more interesting for each i-th neuron of each layer to transport the i-th input. Given that we train layers starting by the end, the last layer would directly see "all inputs". Of course, if you have more hidden nodes than inputs, you'll still have some "joker" neurons starting from zero. Will implement this in a configurable way, to leave the choice to "transport" only the first n inputs or to transport all inputs as far as possible into the network.
About initial noise in the "weights" (== exponents) of mult-layers:
I'll experiment with restricting the multiplication-layer to positive powers, i.e., avoid divisions.
Closing this, it works. Next upgrades will be done in a new issue and branch.
There is quite a bit of litterature on multiplicative networks, and it sounds very good so far, nice! One paper I like a lot is: http://sci2s.ugr.es/keel/pdf/specific/articulo/Schmidtt%20on-the-complexity-of.pdf
Will work on this in a dedicated branch #14