Proposal to replace nuclear norm minimization with a keras version

iskandr / fancyimpute

Multivariate imputation and matrix completion algorithms implemented in Python

Apache License 2.0

1.25k stars 178 forks source link

Proposal to replace nuclear norm minimization with a keras version #42

Closed sergeyf closed 3 years ago

sergeyf commented 7 years ago

https://arxiv.org/pdf/1710.09026v1.pdf

See lemma 1. If we want to optimize something like:

|X - W|_F + lambda * |W|_T,

Where the F is the Frobenius norm (taken only over known values), and T is the trace/nuclear norm, we can instead optimize:

|X - UV|_F + lambda/2 * (|U|_F + |V|_F).

I find this extremely weird, but there you have it. If W is sized m by n, then U is sized m by min(n, m), and V is sized min(n, m) by n. Once you're done training, you can compute the truncated SVD of W = UV

One thing I don't get is how stage 2 actually works, but I don't think we would need it here. We could just do a single layer neural network, and optimize over U and V, and then take the truncated svd of W = UV and see if it has good reconstruction properties?

@iskandr what do you think?

iskandr commented 6 years ago

I also find this very weird. Can you explain how the trace norm actually gets minimized in "stage 1" of this idea?

sergeyf commented 6 years ago

I can't claim to intuitively understand the reason for Lemma 1, but it's a mathematical fact. It's actually what my previous PR does already since it has L2 regularization on both matrices. I think we may just have to add a truncated SVD on top of that?

One differences in the PR I submitted is that the there are also "bias" terms.

sergeyf commented 6 years ago

Here's a better summary: https://pdfs.semanticscholar.org/presentation/8947/d651bb2fc5a1f7b4a30e9867624b2e478c4c.pdf

But more-or-less this supports the paper I posted before. One thing I don't quite get is how to decide on the actual rank to use. When you're minimizing nuclear norm it does this for you, but the SGD approach requires you to specify a regularization parameter, which is related to rank.

sergeyf commented 5 years ago

Another paper that gives us low-rankedness via deep linear networks:

https://arxiv.org/abs/1905.13655 https://www.offconvex.org/2019/06/03/trajectories/

One day I'll have enough time/energy to actually experiment with this...