varunranga / zorb-numpy

ZORB: A Derivative-Free Backpropagation Algorithm for Neural Networks
21 stars 4 forks source link

questions #1

Closed LifeIsStrange closed 3 years ago

LifeIsStrange commented 3 years ago

Hi, I'm not an expert but I have a few questions:

While ZORB is impressive performance wise, how much can variate the accuracy gap vs Adam?Extensive testing is needed. Can we port transformers such as BERT or XLNET to ZORB? Do ZORB works as is with current activations functions or do they need to be manually tweaked? Can we use the state of the art activation function Mish with ZORB? Would it get the same benefits or because ZORB works differently would not yield benefits over RELU? Can gradient centralization be ported for ZORB, again does it make sense? Can some optimizers be ported for ZORB such as adabelief and stochastic weight averaging? Can meta optimizers such as Lookahead be ported?

What about the memory usage (vram/ram) difference between ZORB and BP for graining? What about ZORB/BP differences for inference computational/memory cost?

What kind of data/task would be the worse case scenario for ZORB accuracy Vs BP? I feel like the gradients you don't update subsequently (only once) must have valuable information that ZORB loose, if so what is that information, when does it matter the most?

Thanks for this groundbreaking work.

varunranga commented 3 years ago

Hi @LifeIsStrange ,

I do agree. Extensive testing is required, especially with more datasets. But from the experiments, the variance in results for ZORB is generally lower than the variance in the performance of Adam. For, fully connected networks, ZORB-trained networks perform comparably with Adam-trained networks and the variance in the gap is also low. Since transformer based models essentially use matrix multiplications, ZORB can be applied on these architectures. If the activation function has a range from (-\inf, +\inf), then you need not tweak the activation function. Else the tweak is to scale the feedback matrix to the range of the activation function. I have not tried out Mish for ZORB. Networks that contain only ReLU activation functions does not perform well. But using Tanh and ReLU in two consecutive layers shows better performance. We have run experiments where we train a network initially using ZORB, and then fine-tune it (to train it with another loss function (such as crossentropy)) using gradient descent.

ZORB uses the pinv operation from numpy, which internally uses svd. This operation is a significant source of memory usage. The datatype I used for weights is float64. There are no differences in inference computational/memory cost. You may use small amount of memory to store scaling/shifting parameters for activation functions.

Due to memory requirements (and the lack of optimization), large amounts of data would cause some problems for ZORB (if you do not have enough memory; a quick fix could be to use float16). ZORB works well in the small-medium sized dataset regime. ZORB is well suited for regression tasks. ZORB is not set up for online learning/temporal learning. I don't believe there is any information loss (other than in the T-SVD procedure which truncates low eigenvalues). ZORB uses the pseudoinverse operation which directly solves for values in a system of equations.

Thank you for your questions!