Closed sergeyf closed 3 years ago
I also find this very weird. Can you explain how the trace norm actually gets minimized in "stage 1" of this idea?
I can't claim to intuitively understand the reason for Lemma 1, but it's a mathematical fact. It's actually what my previous PR does already since it has L2 regularization on both matrices. I think we may just have to add a truncated SVD on top of that?
One differences in the PR I submitted is that the there are also "bias" terms.
Here's a better summary: https://pdfs.semanticscholar.org/presentation/8947/d651bb2fc5a1f7b4a30e9867624b2e478c4c.pdf
But more-or-less this supports the paper I posted before. One thing I don't quite get is how to decide on the actual rank to use. When you're minimizing nuclear norm it does this for you, but the SGD approach requires you to specify a regularization parameter, which is related to rank.
Another paper that gives us low-rankedness via deep linear networks:
https://arxiv.org/abs/1905.13655 https://www.offconvex.org/2019/06/03/trajectories/
One day I'll have enough time/energy to actually experiment with this...
https://arxiv.org/pdf/1710.09026v1.pdf
See lemma 1. If we want to optimize something like:
|X - W|_F + lambda * |W|_T,
Where the F is the Frobenius norm (taken only over known values), and T is the trace/nuclear norm, we can instead optimize:
|X - UV|_F + lambda/2 * (|U|_F + |V|_F).
I find this extremely weird, but there you have it. If W is sized m by n, then U is sized m by min(n, m), and V is sized min(n, m) by n. Once you're done training, you can compute the truncated SVD of
W = UV
One thing I don't get is how stage 2 actually works, but I don't think we would need it here. We could just do a single layer neural network, and optimize over U and V, and then take the truncated svd of
W = UV
and see if it has good reconstruction properties?@iskandr what do you think?