perceptron ML7 - Githubissues

henryliangt commented 1 year ago

Perceptron learning rule

Perceptron learning algorithm

Checking the stopping condition

henryliangt commented 1 year ago

Perceptron - example

Solution (1)

Solution (2)

Perceptron - capabilities

Perceptron - history

henryliangt commented 1 year ago

Perceptron for AND gate

Perceptron for OR gate

Perceptron for NAND gate

Perceptron for XOR gate

henryliangt commented 1 year ago

Perceptron for XOR gate 2

Adding more layers

henryliangt commented 1 year ago

Multi-layer perceptrons and backpropagation algorithm Backpropagation algorithm • An algorithm that is used to train multi-layer perceptron neural networks • Proposed by Paul Werbos in 1974; later re-discovered by David Rumelhart, Geoffrey Hinton, Ronald Williams 1986; David Parker 1985; Yann LeCun 1985

Network architecture • Layers of neurons – input layer, output layer, 1 or mode hidden layers • Feedforward NN – each neuron receives input only from the previous layer • Fully connected network – each neuron in the current layer is connected with all neurons from the previous

Neuron • A single neuron calculates the weighed sum of the outputs of the previous layer, which is passed through a transfer function • The transfer function needs to be differentiable • Most widely used transfer function: sigmoid

Backpropagation algorithm - idea

• For each training example {x,t}, x={x1 , x2 , ..xn } • Propagate x through the network and calculate the output o. Compare it with the target output t and calculate the error. • Update weights of the network to reduce the error • Until error over all examples < threshold • Adjusts the weights backwards - from the output to the input neurons by propagating the weight change to minimize the error

• How to calculate the weight change? • By defining an error function and using the gradient descent algorithm to calculate it

Steepest gradient descent • We define an error function (also called cost or loss function), e.g. MSE over all training examples • We can minimize it by using the steepest gradient descent algorithm (standard optimization method) • The NN weights are iteratively updated by moving downhill - in the direction that reduces the error the most – this is the direction of the negative of the gradient

henryliangt commented 1 year ago

Gradient descent (2)

Local and global minimum

Weight update formulas

Weight update for sigmoid transfer function

henryliangt commented 1 year ago

Training NN with the backpropagation algorithm

Initialize all weights (biases incl.) to small random values, e.g. [-1,1] • 2. Repeat until stopping condition is satisfied: • (forward pass) • --Present a training example and compute the network output • (backward pass) • --Compute the  values for the output neurons and update the weights to the output layer • --Starting with output layer, repeat for each layer in the network: • propagate the  values back to the previous layer • update the weights between the two layers • 3. Check the stopping condition at the end of each epoch – e.g. the error on the training set is below a threshold or a maximum number of epochs has been reached

Backpropagation algorithm - example

Backpropagation algorithm – example (2)

Backpropagation algorithm – example (3)

Backpropagation algorithm – example (4)

henryliangt commented 1 year ago

Stochastic gradient descent • The standard gradient descent algorithm sums the error of all training examples and then updates the weights • The Stochastic Gradient Descent (SGD) updates the weights after each example – this is the algorithm we used; SGD is usually faster • Mini-batch SGD (also called batch gradient descent) – sum the error of mini-batches of training examples, update the weights

Universality of multi-layer perceptrons • Multi-layer perceptrons trained with the backpropagation algorithm are universal approximators – theorems: • Any continuous function can be approximated with arbitrary small error by a network with one hidden layer (Cybenko 1989, Hornik et al. 1989): • Any function (including discontinuous) can be approximated to arbitrary small error by a network with two hidden layers (Cybenko 1988) • These are existence theorems – they don’t say how to choose the network architecture and hyperparameters

henryliangt commented 1 year ago

Design issues - architecture • Number of neurons in the input layer • Numerical attributes - 1 neuron per attribute • Categorical attributes with k values – k neurons per attribute, one-hot encoding • One-hot encoding - represent a k-valued attribute with k binary attributes, only 1 of which is set to 1 (“hot”) and all the others are 0 • E.g. outlook had 3 values – sunny, overcast, rainy; one-hot encoding: 1 0 0 for outlook=sunny, 0 1 0 for outlook=overcast and 0 0 1 for outlook=rainy • Number of neurons in the output layer • 1 neuron for binary class problem • k neurons for k-class problem, one-hot encoding

Design issues – architecture (2)

Presentation of training examples • Standard approach: • Each epoch runs through the same training examples, one by one, always in the same sequence • Alternatives: • 1. For each epoch, permutate the training examples • 2. Present more often the examples with higher error and less often the examples with lower error • 3. Present the examples not one by one, but in batches of N examples, summing up their individual errors and updating after each batch (minibatch)

Learning rate

Learning rate - constant

Demo from https://towardsai.net/p/machine-learning/analysis-of-learning-rate-in-gradient-descent-algorithm-using-python

Learning rate – time-dependent • Alternatives: variable learning rate – time-dependent • Start with a high learning rate, gradually decrease it – motivation: • Big learning rate initially -> greater weight change - can reduce the number of training epochs and may even help to jump over some local minima • Later - decrease the learning rate to prevent overshooting the global minimum • Different decay schedules: • 0 = 𝑛−1 1+𝑑+𝑛 (time-based) 𝑛= 0𝑒 −𝑑𝑛 (exponential) Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 7, 2022 n – epoch number, d - decay rate (hyperparameter); e.g. if 0=0.3, values in the first 5 epochs:

henryliangt / usyd

perceptron ML7 #32