madeleineudell / LowRankModels.jl

LowRankModels.jl is a julia package for modeling and fitting generalized low rank models.
Other
190 stars 65 forks source link

Applying model to new data #115

Open gdbeck opened 3 years ago

gdbeck commented 3 years ago

Hi there, This looks a great package. I'm particularly interested in the ability to fit LRMs to datasets with missing data (or in my case, outliers that need to be masked). I have a quick question that may be pretty basic, but an answer would help me to apply the code to my own data. Apologies if I've missed something in the documentation. I'm also fairly new to Julia.

If I fit a PCA model to a set of training data A (following your example):

loss        = QuadLoss()
r           = ZeroReg()
n_comp      = 1
glrm        = GLRM(A,loss,r,r,n_comp)
X,Y,ch.     = fit!(glrm)

how do I then apply the same model to a new set of data B? I would like to keep X fixed and obtain new values Y_b that give the best fit of X to B. That is, I would like to project the observations in B onto the PCA components found from A.

There are other PCA packages in Julia that will do this (e.g., the reconstruct function in MultivariateStats), but they don't seem to be able to handle missing data or sparse arrays.

Thanks in advance! Any help is appreciated!

mihirparadkar commented 3 years ago

Hi!

I want to first clarify the intent of the question. Let's say A is a matrix (or DataFrame/sparse matrix) of m rows by n columns. The GLRM (assuming real-valued or boolean-valued data for simplicity) produces a matrix X of m rows by k columns, and a matrix Y of k rows by n columns, where k is the rank.

It sounds like you have another dataset B, of size p rows by n columns. B's projection on the PCA components from A would be a matrix of size p rows by k columns. In PCA with no missing values and centered data, this would be a matrix multiplication (B * Y' *<a diagonal matrix>). However, that projection doesn't work with the structure of GLRM because that formula is only correct with a quadratic (least-squares) loss function.

With LowRankModels, the easiest way to do this is to fit another GLRM while holding Y constant. You can do this like so:

loss             = QuadLoss() # Or whatever loss you chose before
r_x              = ZeroReg()    # Or whichever regularizer you desired on X
r_y              = [FixedLatentFeaturesConstraint(Y[:, i]) for i=1:size(Y, 2)]
n_comp      = 1
glrm_b        = GLRM(B, loss, r_x, r_y, n_comp)
X_b, Y, ch   = fit!(glrm_b)

If you want to calculate a new Y matrix instead of a new X matrix, just keep r_y to be whatever you used as r, and define r_x = [FixedLatentFeaturesConstraint(X[:, i]) for i=1:size(X, 2)]

gdbeck commented 3 years ago

That works perfectly! Thank you very much for your help, and for replying so quickly!