mantasu / cs224n

Solutions for CS224n (2022)
57 stars 21 forks source link

Assignment2Q1b)1): When is the gradient zero? #1

Open AlexSalinas99 opened 1 year ago

AlexSalinas99 commented 1 year ago

Hey Mantas!

I was looking at your solution for the assignment 2. Specifically question 1, b) 1) -> When is the gradient zero?

Your answers look good, since they indeed could get a gradient zero. Nonetheless, I believe this won't occur in every case, as the update depends on all the outside words. I think a more accurate answer could be the following three scenarios:

a) Trivial one: All outside word embeddings are zero vectors. b) Error vector (y^-y) = 0, that is when the predicted conditional probabilities match the true distribution of outside words. c) Rare one: The error vector, although non-zero, is orthogonal to the subspace spanned by all outside vectors.

In all three scenarios no update will ever occur, as the gradient will always be the zero vector. Happy to continue the discussion!

Alex

mantasu commented 1 year ago

Hi Alex,

Thanks for pointing this out. I believe I only missed c) in my answer but, indeed, it can be made clearer and improved. Before I update it, there are a few things I need to clear up so lemme know what you think:

  1. I think a) can be generalized (similar to what I already have): some of the outside word embeddings are zero vectors (including the $o^{\text{th}}$ column) and, for those that are not, the corresponding components in $\hat{\mathbf{y}}$ are either zeros (rare softmax outputs of 0) or specific probability values that cause each dotted row of $U$ to result in 0 (inferring somewhat orthogonality).
  2. For c), are you sure it is okay to say the error vector is orthogonal to $U$? Let's say $\boldsymbol{\delta}$ is the error vector and $U$ is the outside word embeddings (each column represents an outside word embedding). Since we are computing dot products not with the outside vectors, i.e., not $U^{\top}\boldsymbol{\delta}$, but rather their components, i.e., $U\boldsymbol{\delta}$, is it still okay to say that $\boldsymbol{\delta}$ is orthogonal to the subspace of outside vectors than the "subspace of their components"?