mml-book / mml-book.github.io

Companion webpage to the book "Mathematics For Machine Learning"
13.16k stars 2.42k forks source link

The intentsion of the formula 7.42 ? #408

Open zhfkt opened 5 years ago

zhfkt commented 5 years ago

This request is not about a mistake and it is a mathematic question.

On the P239 (version 2019-08-20), it said

"Taking the derivative of L(x;lamda) with respect to x and setting it to zero gives us C+AT * lambda ( Formula 7.42 ) "

I would like to ask why we need to "Taking the derivative of L(x;lamda) with respect to x and setting it to zero ". The L(x;lamda) is a linear function with respect to x. So it didn't have the min value, right ? Why do we need to "Taking the derivative of L(x;lamda) with respect to x and setting it to zero " ? The intentsion of the formula 7.42 is not very clear in the book

Sorry for my bad mathematics skills and understandings.

Thank you,

aliunluml commented 5 years ago

The dual Langrangian \mathfrak{D}(\lambda)=\min_{x \in \mathbb{R}^{d}}\mathfrak{L}(x,\lambda). This is given in equation 7.27.

To get \min_{x \in \mathbb{R}^{d}}\mathfrak{L}(x,\lambda), we would need to take a derivative of \mathfrak{L}(x,\lambda) and equate it to 0, solve for x and then substitute that x into \mathfrak{L}(x,\lambda) to get \mathfrak{D}(\lambda). Our aim here is to fix x that minimizes \mathfrak{L}(x,\lambda) and then maximize that (\mathfrak{D}(\lambda)) for \lambda.

We can do this because we know \mathfrak{L}(x,\lambda) is convex and so it will be minimum for the x that makes its derivative equal to 0.

The following part below may have very hasty notation. Just get the intuition.

In this example, we don't really solve for x since there is no x. It instead cancels x indirectly as: f(x)=ax+b (Imagine this to be 7.41) f'(x)=a a=0 => f_{new}(x)=b Since it cancels x in the end, it "fixes f(x) for x". I may be wrong but this is how I got it. Hope it helps.

Edit: Added spacing

zhfkt commented 5 years ago

The dual Langrangian \mathfrak{D}(\lambda)=\min_{x \in \mathbb{R}^{d}}\mathfrak{L}(x,\lambda). This is given in equation 7.27.

To get \min_{x \in \mathbb{R}^{d}}\mathfrak{L}(x,\lambda), we would need to take a derivative of \mathfrak{L}(x,\lambda) and equate it to 0, solve for x and then substitute that x into \mathfrak{L}(x,\lambda) to get \mathfrak{D}(\lambda). Our aim here is to fix x that minimizes \mathfrak{L}(x,\lambda) and then maximize that (\mathfrak{D}(\lambda)) for \lambda.

We can do this because we know \mathfrak{L}(x,\lambda) is convex and so it will be minimum for the x that makes its derivative equal to 0.

The following part below may have very hasty notation. Just get the intuition.

In this example, we don't really solve for x since there is no x. It instead cancels x indirectly as: f(x)=ax+b (Imagine this to be 7.41) f'(x)=a a=0 => f_{new}(x)=b Since it cancels x in the end, it "fixes f(x) for x". I may be wrong but this is how I got it. Hope it helps.

Edit: Added spacing

Hi Aliunlu97 ,

Thank you for joining the discussion and sharing you understandings on this question. I agree with the first part of your thread. i.e. in the following


The dual Langrangian \mathfrak{D}(\lambda)=\min_{x \in \mathbb{R}^{d}}\mathfrak{L}(x,\lambda). This is given in equation 7.27.

To get \min_{x \in \mathbb{R}^{d}}\mathfrak{L}(x,\lambda), we would need to take a derivative of \mathfrak{L}(x,\lambda) and equate it to 0, solve for x and then substitute that x into \mathfrak{L}(x,\lambda) to get \mathfrak{D}(\lambda). Our aim here is to fix x that minimizes \mathfrak{L}(x,\lambda) and then maximize that (\mathfrak{D}(\lambda)) for \lambda.

We can do this because we know \mathfrak{L}(x,\lambda) is convex and so it will be minimum for the x that makes its derivative equal to 0.


But for your example, I didn't understand it clearly. For the "f(x)=ax+b", if we want to find an x to minimize the f(x) as much as possible, we may set the x to the negative infinity (if a>0 ), or set the x to the positive infinity ( if a<0), right ? Why do we need to set a = 0 to eliminate the x in the f(x) ? I didn't understand much in the saying "Since it cancels x in the end, it fixes f(x) for x".

Sorry for my unconscious idea on this mathematics field.

Thank you.

aliunluml commented 5 years ago

Hi,

if we want to find an x to minimize the f(x) as much as possible, we may set the x to the negative infinity (if a>0 ), or set the x to the positive infinity ( if a<0), right ?

We could have done such a thing if there were no constraints. I forgot to mention constraints in my example so thanx for that good point. My example as I said was very hasty. In equation 7.43 for example, we have the 2 constrains that we are subject to.

\lambda \geq 0 (This is always the case. See the part preceding 7.20a) c+A^{\top}\lambda = 0 (This is due to us taking derivative and equating it to 0. Think of this as a consequence of that)

Given these constraints, we cannot have -b^{\top}\lambda as positive infinity since these 2 constraints prevent us from doing that. See Figure 7.9 for example.

By the way, we want to eliminate the dependence on x ("fix x") because we want to use the minimax inequality (see equation 7.27). We need to first minimize with x and then maximize with \lambda.

Why do we need to set a = 0 to eliminate the x in the f(x) ?

What I meant was just to show how we got c+A^{\top}\lambda = 0 in the transition from 7.41 to 7.42. It is just a derivation. We say f'(x)=0. Since f'(x)=a, a =0. We do not set a = 0 directly. We set it because it is the derivative of f(x).

I didn't understand much in the saying "Since it cancels x in the end, it fixes f(x) for x".

What I meant was that we found the value that minimizes f(x) (a=0) which makes f(x)=ax+b=0x+b=b Hence it cancels x. It eliminates the dependency on x. My example was not a good example.

Edit: Added the last paragraph Edit2: Answered questions.

zhfkt commented 5 years ago

Hi,

if we want to find an x to minimize the f(x) as much as possible, we may set the x to the negative infinity (if a>0 ), or set the x to the positive infinity ( if a<0), right ?

We could have done such a thing if there were no constraints. I forgot to mention constraints in my example so thanx for that good point. My example as I said was very hasty. In equation 7.43 for example, we have the 2 constrains that we are subject to.

\lambda \geq 0 (This is always the case. See the part preceding 7.20a) c+A^{\top}\lambda = 0 (This is due to us taking derivative and equating it to 0. Think of this as a consequence of that)

Given these constraints, we cannot have -b^{\top}\lambda as positive infinity since these 2 constraints prevent us from doing that. See Figure 7.9 for example.

By the way, we want to eliminate the dependence on x ("fix x") because we want to use the minimax inequality (see equation 7.27). We need to first minimize with x and then maximize with \lambda.

Why do we need to set a = 0 to eliminate the x in the f(x) ?

What I meant was just to show how we got c+A^{\top}\lambda = 0 in the transition from 7.41 to 7.42. It is just a derivation. We say f'(x)=0. Since f'(x)=a, a =0. We do not set a = 0 directly. We set it because it is the derivative of f(x).

I didn't understand much in the saying "Since it cancels x in the end, it fixes f(x) for x".

What I meant was that we found the value that minimizes f(x) (a=0) which makes f(x)=ax+b=0x+b=b Hence it cancels x. It eliminates the dependency on x. My example was not a good example.

Edit: Added the last paragraph Edit2: Answered questions.

Hi Aliunlu97,

The concern on the above reply is "taking derivative and equating it to 0." for the linear function. If the f(x) is the quadratic function such ax^2 , we can get the min/max value by "taking derivative and equating it to 0." But f(x) is a linear function so we can not apply the same for it I guess.

I found another description for this problem on the web https://www.princeton.edu/~chiangm/ele539l3a.pdf

image

We could see that the purpose of equating to 0 is not derived by taking derivative. It is because the max-min function will be meaningful only in 0 case. The max-min function will generate negative infinite in the other case. This pdf is much clearer than this book says I guess.

Thank you.

aliunluml commented 5 years ago

Hi,

We could see that the purpose of equating to 0 is not derived by taking derivative. It is because the max-min function will be meaningful only in 0 case.

Our aim is to eliminate the dependency on x (cancel x) as I stated in my answer:

By the way, we want to eliminate the dependence on x ("fix x") because we want to use the minimax inequality (see equation 7.27).

I got what you mean but they are actually not different things. In the slides, aside from being the coefficient of x, c+A^{\top}\nu-\lambda is the slope of L(x,\lambda,\nu). He sets it to 0. For this example, it can be understood that he sets it (the coefficient of x) to zero to cancel x. When he sets it to zero, it becomes a constraint.

However, this approach is for this particular linear programming example. It is not applicable to the quadratic programming example. We cannot always set Q to 0 for example, even though it is the "coefficient" of x^{2}. Then, we would not be satisfying a constraint (Q=0) if we are given a Q other than 0. This is why, our general approach is derivation and setting it to 0 to solve for x. It is a general approach that is applicable to a wider range of cases.

But I agree with you that the slides are more clear for this particular example. However, the book's aim in that part is to demonstrate the general approach in 2 cases (one being linear, the other being quadratic) which is better since it shows us a generally applicable methodology/procedure.

Sincerely,

Edit: Clarification made.

zhfkt commented 5 years ago

Thank you. I agree with you that this book is "to demonstrate the general approach in 2 cases". Discussion with you helps me a lot.

aliunluml commented 5 years ago

You are welcome.

mpd37 commented 5 years ago

Thanks for your contribution. Can this issue be closed?

zhfkt commented 5 years ago

The admin can close the issue if you want. The admin can also keep the issue open if you would like to attract more audience to join the discussion.