week 2. Basics of Neural Network Programming

copyright by deep learning.ai, coursera Deep Learning

Binary Classification

Neural network 프로그래밍의 기초를 살펴볼 것이다. It turns out that when you implement a neural network there are some techniques that are going to be really important. Neural network를 구현할 때 실제로 중요한 일부 기술이 있다는 것이 밝혀졌다. For example, if you have a training set of m training examples, you might be used to processing the training set by having a four loop step through your m training examples. 예를 들어, M개의 트레이닝 셋이 있다면 4단계의 루프를 거쳐 M개의 트레이닝 세트를 처리하는데 익숙하다.
But it turns out that when you're implementing a neural network, you usually want to process your entire training set without using an explicit four loop to loop over your entire training set. 그러나 Neural network를 통해서는 일반적으로 4단계의 루프를 사용하지 않고 전체 트레이닝 세트를 반복 처리해야 한다. Another idea, when you organize the computation of, in your network, usually you have what's called a forward pause or forward propagation step, followed by a backward pause or what's called a backward propagation step. 네트워크 처리를 구성할 때 일반적으로 forward pause 또는 forward propagation 단계 항목이 있다. you also get an introduction about why the computations, in learning a neural network can be organized in this for propagation and a separate backward propagation. Neural network 학습할 때 계산을 왜 propagation하고 별도로 backward propagation하는지 대해 다룰 것이다.

For this week's materials I want to convey these ideas using logistic regression in order to make the ideas easier to understand. 이해를 돕기 이해서 로지스틱 회귀를 사용하여 이러한 개념을 전달할 것이다.

Logistic regression is an algorithm for binary classification. 로지스틱 회귀는 이진 분류를 위한 알고리즘 이다.

So let's start by setting up the problem. Here's an example of a binary classification problem. You might have an input of an image, like that, and want to output a label to recognize this image as either being a cat, in which case you output 1, or not-cat in which case you output 0, and we're going to use y to denote the output label. Let's look at how an image is represented in a computer. To store an image your computer stores three separate matrices corresponding to the red, green, and blue color channels of this image.

So if your input image is 64 pixels by 64 pixels, then you would have 3 64 by 64 matrices corresponding to the red, green and blue pixel intensity values for your images. Although to make this little slide I drew these as much smaller matrices, so these are actually 5 by 4 matrices rather than 64 by 64.

So to turn these pixel intensity values- Into a feature vector, what we're going to do is unroll all of these pixel values into an input feature vector x. So to unroll all these pixel intensity values into Feature vector, what we're going to do is define a feature vector x corresponding to this image as follows. We're just going to take all the pixel values 255, 231, and so on. 255, 231, and so on until we've listed all the red pixels. And then eventually 255 134 255, 134 and so on until we get a long feature vector listing out all the red, green and blue pixel intensity values of this image.

If this image is a 64 by 64 image, the total dimension of this vector x will be 64 by 64 by 3 because that's the total numbers we have in all of these matrixes. Which in this case, turns out to be 12,288, that's what you get if you multiply all those numbers. And so we're going to use nx=12288 to represent the dimension of the input features x. And sometimes for brevity, I will also just use lowercase n to represent the dimension of this input feature vector. So in binary classification, our goal is to learn a classifier that can input an image represented by this feature vector x. 따라서 binary classification의 목표는 특징 벡터 x가 나타내는 이미지를 입력 할 수있는 분류자를 배우는 것이다. And predict whether the corresponding label y is 1 or 0, that is, whether this is a cat image or a non-cat image. 해당 레이블 y가 1인지 0인지, 즉 이것이 고양이 인지 비 고양이 이미지인지를 예측한다.

이를 수식으로 표현하면,

x = dimensional feature vector y = label (0 or 1) m = training set

(x1, y1) - training set 1 (x2, y2) - training set 2 : (xm, ym) - training set m

Finally, to output all of the training examples into a more compact notation, we're going to define a matrix, capital X. As defined by taking you training set inputs x1, x2 and so on and stacking them in columns. So we take X1 and put that as a first column of this matrix, X2, put that as a second column and so on down to Xm, then this is the matrix capital X. 행렬를 대문자 X를 정의할 것이다. 입력 x1, x2 등을 설정하고 열에 쌓는 방법을 배우면서 정의 한다. X1 = 첫번째 열 X2 = 두번째 열 : Xm = m번째 열

So this matrix X will have M columns, where M is the number of train examples and the number of railroads, or the height of this matrix is NX. 행렬 X는 M개의 열을 가질 것이다. 이 행렬의 높이는 NX이다.

Notice that in other causes, you might see the matrix capital X defined by stacking up the train examples in rows like so, X1 transpose down to Xm transpose. It turns out that when you're implementing neural networks using this convention I have on the left, will make the implementation much easier. X1은 Xm의 전치 행렬로 바뀐다. 이 규칙을 사용하여 neural network를 쉽게 구현할 수 있다.

So just to recap, x is a nx by m dimensional matrix, and when you implement this in Python, you see that x.shape, that's the python command for finding the shape of the matrix, that this an nx, m. That just means it is an nx by m dimensional matrix. So that's how you group the training examples, input x into matrix. How about the output labels Y? It turns out that to make your implementation of a neural network easier, it would be convenient to also stack Y In columns. So we're going to define capital Y to be equal to Y 1, Y 2, up to Y m like so.

So Y here will be a 1 by m dimensional matrix. And again, to use the notation without the shape of Y will be 1, m. Which just means this is a 1 by m matrix. And as you influence your new network, mtrain discourse, you find that a useful convention would be to take the data associated with different training examples, and by data I mean either x or y, or other quantities you see later. But to take the stuff or the data associated with different training examples and to stack them in different columns, like we've done here for both x and y.

So, that's a notation we we'll use e for a regression and for neural networks networks later in this course. If you ever forget what a piece of notation means, like what is M or what is N or what is something else, we've also posted on the course website a notation guide that you can use to quickly look up what any particular piece of notation means. So with that, let's go on to the next video where we'll start to fetch out logistic regression using this notation.

Logistic Regression

This is a learning algorithm that you use when the output labels Y in a supervised learning problem are all either zero or one, so for binary classification problems. logistic regression은 supervised learning 문제에서 출력 label Y가 모두 0 or 1인 경우에 사용하는 학습 알고리즘 이다. Given an input feature vector X maybe corresponding to an image that you want to recognize as either a cat picture or not a cat picture, you want an algorithm that can output a prediction, which we'll call Y hat, which is your estimate of Y. More formally, you want Y hat to be the probability of the chance that, Y is equal to one given the input features X.

So in other words, if X is a picture, as we saw in the last video, you want Y hat to tell you, what is the chance that this is a cat picture? So X, as we said in the previous video, is an X dimensional vector, given that the parameters of logistic regression will be W which is also an X dimensional vector, together with b which is just a real number. So given an input X and the parameters W and b, how do we generate the output Y hat? Well, one thing you could try, that doesn't work, would be to have Y hat be w transpose X plus B, kind of a linear function of the input X. And in fact, this is what you use if you were doing linear regression. But this isn't a very good algorithm for binary classification because you want Y hat to be the chance that Y is equal to one.

So Y hat should really be between zero and one, and it's difficult to enforce that because W transpose X plus B can be much bigger then one or it can even be negative, which doesn't make sense for probability, that you want it to be between zero and one. So in logistic regression our output is instead going to be Y hat equals the sigmoid function applied to this quantity. This is what the sigmoid function looks like. If on the horizontal axis I plot Z then the function sigmoid of Z looks like this. So it goes smoothly from zero up to one. Let me label my axes here, this is zero and it crosses the vertical axis as 0.5. So this is what sigmoid of Z looks like and we're going to use Z to denote this quantity, W transpose X plus B. Here's the formula for the sigmoid function. Sigmoid of Z, where Z is a real number, is one over one plus E to the negative Z. So notice a couple of things. If Z is very large then E to the negative Z will be close to zero. So then sigmoid of Z will be approximately one over one plus something very close to zero, because E to the negative of very large number will be close to zero. So this is close to 1. And indeed, if you look in the plot on the left, if Z is very large the sigmoid of Z is very close to one. Conversely, if Z is very small, or it is a very large negative number, then sigmoid of Z becomes one over one plus E to the negative Z, and this becomes, it's a huge number. So this becomes, think of it as one over one plus a number that is very, very big, and so, that's close to zero. And indeed, you see that as Z becomes a very large negative number, sigmoid of Z goes very close to zero. So when you implement logistic regression, your job is to try to learn parameters W and B so that Y hat becomes a good estimate of the chance of Y being equal to one. Before moving on, just another note on the notation. When we programmed neural networks, we'll usually keep the parameter W and parameter B separate, where here, B corresponds to an inter-spectrum. In some other courses, you might have seen a notation that handles this differently. In some conventions you define an extra feature called X0 and that equals a one. So that now X is in R of NX plus one. And then you define Y hat to be equal to sigma of theta transpose X. In this alternative notational convention, you have vector parameters theta, theta zero, theta one, theta two, down to theta NX And so, theta zero, place a row a B, that's just a real number, and theta one down to theta NX play the role of W. It turns out, when you implement you implement your neural network, it will be easier to just keep B and W as separate parameters.

And so, in this class, we will not use any of this notational convention that I just wrote in red. If you've not seen this notation before in other courses, don't worry about it. It's just that for those of you that have seen this notation I wanted to mention explicitly that we're not using that notation in this course. But if you've not seen this before, it's not important and you don't need to worry about it. So you have now seen what the logistic regression model looks like. Next to change the parameters W and B you need to define a cost function. Let's do that in the

참고 : https://www.youtube.com/watch?v=kHLqMsN7yao

nowol79 / MOOC

week 2. Basics of Neural Network Programming #12

Binary Classification

Logistic Regression