Open henryliangt opened 1 year ago
Perceptron learning rule
Perceptron learning algorithm
Checking the stopping condition
Perceptron - example
Solution (1)
Solution (2)
Perceptron - capabilities
Perceptron - history
Perceptron for AND gate
Perceptron for OR gate
Perceptron for NAND gate
Perceptron for XOR gate
Perceptron for XOR gate 2
Adding more layers
Multi-layer perceptrons and backpropagation algorithm Backpropagation algorithm • An algorithm that is used to train multi-layer perceptron neural networks • Proposed by Paul Werbos in 1974; later re-discovered by David Rumelhart, Geoffrey Hinton, Ronald Williams 1986; David Parker 1985; Yann LeCun 1985
Network architecture • Layers of neurons – input layer, output layer, 1 or mode hidden layers • Feedforward NN – each neuron receives input only from the previous layer • Fully connected network – each neuron in the current layer is connected with all neurons from the previous
Neuron • A single neuron calculates the weighed sum of the outputs of the previous layer, which is passed through a transfer function • The transfer function needs to be differentiable • Most widely used transfer function: sigmoid
Backpropagation algorithm - idea
• For each training example {x,t}, x={x1 , x2 , ..xn } • Propagate x through the network and calculate the output o. Compare it with the target output t and calculate the error. • Update weights of the network to reduce the error • Until error over all examples < threshold • Adjusts the weights backwards - from the output to the input neurons by propagating the weight change to minimize the error
• How to calculate the weight change? • By defining an error function and using the gradient descent algorithm to calculate it
Steepest gradient descent • We define an error function (also called cost or loss function), e.g. MSE over all training examples • We can minimize it by using the steepest gradient descent algorithm (standard optimization method) • The NN weights are iteratively updated by moving downhill - in the direction that reduces the error the most – this is the direction of the negative of the gradient
Gradient descent (2)
Local and global minimum
Weight update formulas
Weight update for sigmoid transfer function
Training NN with the backpropagation algorithm
Backpropagation algorithm - example
Backpropagation algorithm – example (2)
Backpropagation algorithm – example (3)
Backpropagation algorithm – example (4)
Stochastic gradient descent • The standard gradient descent algorithm sums the error of all training examples and then updates the weights • The Stochastic Gradient Descent (SGD) updates the weights after each example – this is the algorithm we used; SGD is usually faster • Mini-batch SGD (also called batch gradient descent) – sum the error of mini-batches of training examples, update the weights
Universality of multi-layer perceptrons • Multi-layer perceptrons trained with the backpropagation algorithm are universal approximators – theorems: • Any continuous function can be approximated with arbitrary small error by a network with one hidden layer (Cybenko 1989, Hornik et al. 1989): • Any function (including discontinuous) can be approximated to arbitrary small error by a network with two hidden layers (Cybenko 1988) • These are existence theorems – they don’t say how to choose the network architecture and hyperparameters
Design issues - architecture • Number of neurons in the input layer • Numerical attributes - 1 neuron per attribute • Categorical attributes with k values – k neurons per attribute, one-hot encoding • One-hot encoding - represent a k-valued attribute with k binary attributes, only 1 of which is set to 1 (“hot”) and all the others are 0 • E.g. outlook had 3 values – sunny, overcast, rainy; one-hot encoding: 1 0 0 for outlook=sunny, 0 1 0 for outlook=overcast and 0 0 1 for outlook=rainy • Number of neurons in the output layer • 1 neuron for binary class problem • k neurons for k-class problem, one-hot encoding
Design issues – architecture (2)
Presentation of training examples • Standard approach: • Each epoch runs through the same training examples, one by one, always in the same sequence • Alternatives: • 1. For each epoch, permutate the training examples • 2. Present more often the examples with higher error and less often the examples with lower error • 3. Present the examples not one by one, but in batches of N examples, summing up their individual errors and updating after each batch (minibatch)
Learning rate
Learning rate - constant
Learning rate – time-dependent • Alternatives: variable learning rate – time-dependent • Start with a high learning rate, gradually decrease it – motivation: • Big learning rate initially -> greater weight change - can reduce the number of training epochs and may even help to jump over some local minima • Later - decrease the learning rate to prevent overshooting the global minimum • Different decay schedules: • 0 = 𝑛−1 1+𝑑+𝑛 (time-based) 𝑛= 0𝑒 −𝑑𝑛 (exponential) Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 7, 2022 n – epoch number, d - decay rate (hyperparameter); e.g. if 0=0.3, values in the first 5 epochs: