Neural network documentation

albertouri commented 7 years ago

For someone novice like me, there are some concepts hard to follow:

The logistic regression example defines the NN as
```
GNeuralNet nn;
nn.add(new GBlockLinear(3));
nn.add(new GBlockLogistic());
```
But the attribute selection defines a logistic regression as
```
GNeuralNetLearner nn;
nn.nn().add(new GBlockLinear((size_t)0));
nn.nn().add(new GBlockTanh());
```
- Why they have different activations? Looks like a scaled tanh (which it isn't implemented) is recommended.
- Why the GBlockLinear's output size is set to 3 in the example and 0 in the attribute selection?
In the API doc almost all activators have the following description: An element-wise nonlinearity block. I think this isn't informative enough.
It's unclear the relation between output-input size parameters for a GBlock. None of the examples set an input size, and there is only one note: The layers may be resized as needed when the enclosing neural network is resized to fit the training data. Looks like the input size is always resized by the previous GBlock output size (and by the number of data attributes if it is the first GBlock. But this is only my assumption since there is no clarification about that.
From the documentation the concept of GLayer is unclear. It states that a GLayer has one or more GBlock layers, but latter in the examples it adds more "layers" by just adding more GBlock. Maybe it would be a good idea to rename the method GNeuralNet::add() to GNeuralNet::addLayer() to reflect that you are adding a new GBlock into a new GLayer...
I haven't seen any example of more than one GBlock in the same GLayer, why having this functionality?
Looks like the first layer you add is the "input layer", but if the second layer is a GBlockActivation (like the logistic regression example), it is "still" the input layer. If this is the case, it would be great to have a clear distinction between input layer and hidden layers...

I'm sure I would have more questions, but this is a good start ;) Also, I would be more than happy to modify/extend the documentation if someone clarifies all my doubts.

mikegashler commented 7 years ago

(1a) GNeuralNet implements a neural networks. GNeuralNetLearner is a wrapper around GNeuralNet that makes it compatible with the GSupervisedLearner interface, making it easy to compare with the other learning algorithms in Waffles. (1b) Some activation functions yield better results with some problems than others. The logistic activation function is often used in an output layer when predicting probabilities for categorical values. The tanh activation function is often used in hidden layers. It has the desirable property of approximating identity when weights are initialized with small random values. Many recent publications seem to favor rectifiers. Many other activation functions are in various stages of experimental research. (1c) I would like to add scaled tanh. This is currently a deficiency of Waffles. The advantage it offers is a small improvement in training time (usually), since the weights have to adapt less. (1d) The attribute selector resizes its outputs to fit the problem. The outputs are initialized to a size of zero because this involves no weights, and thus avoids a superfluous allocation. (This is just an OCD performance optimization that probably saves about a nano-second of run-time in the long run.) (2) I agree. (3) In the vast majority of use cases, the number of inputs for a layer will match the number of outputs in the preceding layer. For special purposes, however, it is possible to concatenate blocks such that they work together to form a single layer. In such configurations, one might want the outputs of the previous layer to redundantly feed into both blocks, or set each block to process different portions of the outputs. In these cases, the user would need to specify the number of inputs for each block since there is no way the model could automatically determine it. (4) Yes, "addLayer" would be more descriptive. (5) GNeuralDecomposition uses it. I am also using it in some of my ongoing research experiments. Waffles tends to evolve in whatever direction fits my current research. I agree that few users will ever use this feature. (6) Historically, the weights and activation functions were considered together to be a single layer. Recent trends seem to view them more as two separate layers. I have grown to like the newer paradigm better. It keeps the layers simpler, and it is more accommodating of special-purpose layers (such as batch normalization layers, mask layers for drop connect or other purposes, skip-connection layers, etc.).

Thanks for playing with our code! Any efforts you feel inclined to make toward improving the code or documentation would be most welcome!

albertouri commented 7 years ago

Thanks for your answer! Some thoughts:

Then the logistic regression defined in the "attribute selection" should use a logistic activation function, right?
But the output size should be the number of classes in the dataset, right?. For the "logistic regression example" should be 1 (since we only have 1 class with 3 categories) and in the "attribute selection" should be the labels.cols(). I don't see in the code how "attribute selection" resize the output size (but I see how it rescales the input size to fit the problem).
Mmmm I see now this trending of considering weights and activation functions as separate layers. Maybe it would be a good idea to clarify in the documentation that to form a "hidden layer" or an "output layer" you need at least both of them.

I will modify the documentation and create a pull request ;) How do you feel about migrating the documentation to a markdown language for a faster edition? And maybe host them in Read the Docs

mikegashler commented 7 years ago

Yes, that should be better in theory. In my experience, the logistic activation function seems to take longer to train, unless I am are careful about how I initialize the weights. I usually default to tanh because I'm too lazy to figure out the optimal weight for the logistic function.
GAttributeSelector::trainInner calls GNeuralNetLearner::beginIncrementalLearning, which calls GNeuralNet::resize, which contains a block of code headed by this comment, "// Resize the outputs of the last non-elementwise layer, and all subsequent element-wise layers, to fit the outputs", which is where the outputs are resized to fit the data. (Admittedly, this code could be cleaner. I hope to get there eventually.)
Good idea.
I definitely like the idea of making the docs easier to edit. I don't like the idea of becoming dependent on an additional cloud service. Is that something that would work off-line just as well?

albertouri commented 7 years ago

Got it, I think now I'm ready to improve the documentation ;)

About the documentation, my suggestion is to use Sphinx and Breathe as a bridge between Sphinx and Doxygen. The output will be a html site with both the documentation and API, so yes, it is possible to work "off-line".

The only advantage of ReadTheDocs is that you can config a "webhook" in github to compile the documentation on every commit and update it on ReadTheDocs. This way you don't have to worry about having and outdated online documentation. They also detect the git tags to keep a documentation version for each tag. In any case this is optional ;)

mikegashler / waffles

Neural network documentation #39