Clarification request for 5.113, general form of trace wrt to matrix gradient - Githubissues

mml-book / mml-book.github.io

Companion webpage to the book "Mathematics For Machine Learning"

12.94k stars 2.39k forks source link

Clarification request for 5.113, general form of trace wrt to matrix gradient #212

Closed Antymon closed 5 years ago

Antymon commented 5 years ago

Describe the mistake It seems odd to me that 5.113 claims that derivative of trace wrt to matrix is trace of derivative since obviously dimensions don't seem to match. Either this is some counter intuitive form or a mistake. It also differs from general form of Matrix cookbook I found, which MML book seems to refer to:

*The Matrix Cookbook [ http://matrixcookbook.com ] Kaare Brandt Petersen Michael Syskind Pedersen Version: November 15, 2012

Side note Similar thing applies to 5.114 . Matrix cookbook seems to use scalar gradient for that one, which makes the dimensions obviously hold.

Location Please provide the

version 2018-08-30
Chapter 5
page 156
eq 5.113, 5.114

mpd37 commented 5 years ago

The derivative of the trace is the trace of the derivative: Both the trace and the derivative are linear operators, so we can exchange them.

One of the things that we don't discuss here (and what we also don't really want to get into) is that in the trace is only defined for matrices. If we operate with tensors the trace turns into a tensor contraction (where we sum out two dimensions).

Antymon commented 5 years ago

Clearly I have a problem with understanding this interchangeability of linear operators in this setting. As I understand trace of whatever put in as an argument is a scalar. Is that incorrect? Is trace of derivative not necessarily a scalar? Of course derivatives of traces of matrices are matrices, therefore contradiction when trying to approach this general rule.

mpd37 commented 5 years ago

That's not necessarily correct. The trace of a (DxDxExE) tensor is 1xExE (if you generalize the trace to a tensor contraction along the first two dimensions). Again, this is stuff I really don't want to get into. And that's why the trace of a derivative may not be a scalar.

Regarding exchangeability: https://en.wikipedia.org/wiki/Trace_%28linear_algebra%29

mpd37 commented 5 years ago

Similar arguments apply to the transpose (which is also only defined for matrices, not for general tensors).

Antymon commented 5 years ago

Fair. Even with matrices, not going into tensors, I don't understand how this changeability can hold. But well, clearly that's not the issue of books correctness.

mpd37 commented 5 years ago

Effectively, the question is what to do about it. I agree that this issue causes confusion. I would like to keep this section in the book because it's useful. I'm considering adding a comment that points out these issues, referring to other sources for further clarification.

Would this be a solution for you as well?

Antymon commented 5 years ago

Ah I was never after whole of section. It's just properties 5.113 and 5.114 is what I don't get. If you have any sources that explain those in more detail then surely, there is a chance they would be helpful. But it seems I would personally need something way more elaborate than likes of The Matrix Cookbook. You mention more of books in that chapter - would any of those be helpful with issues I am facing?

mpd37 commented 5 years ago

I'm not sure whether they will be more helpful to be honest. I need to find one that makes sense and is semi-comprehensible.

Antymon commented 5 years ago

Btw can we make any use of rule 5.113 at all without knowing definition of trace for tensors?

mpd37 commented 5 years ago

Not really. That's why I want to add a comment.

I could alternatively formulate things just for the scalar case, but the equations do hold for the vector-valued cases, too.

Antymon commented 5 years ago

Personally, if there was a star for 5.113 and 5.114 explaining that direct use of them wrt matrix will require tensor-related definitions it would probably be sufficient/less confusing for me. I was initially thinking I can apply one or the other in calculation of derivatives without need for tensor-related definitions and the fact that need for tensor-related definitions was arising led me to conclusion that I must be doing something wrong/misunderstanding what presented in the chapter.

mpd37 commented 5 years ago

Good point. What do you think about this remark:

Would this be helpful?

Antymon commented 5 years ago

Yes, I think it would be, at least for me.

mpd37 commented 5 years ago

Great! I'll fix this in the next revision and close this issue. Thanks a lot for making this a bit more clear.

Antymon commented 5 years ago

Likewise, thank you for the patience.