Penalties/regularization of imputed values

ahwillia commented 9 years ago

In many experimental contexts (RNA-sequencing is a good example), ground truth data that are below a detection threshold are not observed due to technical error. Thus, we have a data matrix A with many NA entries. However, we suspect many of these NA entries to be small, though not necessarily zero.

For each observed entry A_ij we have a loss function: L[A_ij, dot(x_i,y_j)]

Maybe for each unobserved entry we could add some regularization: R_a[dot(x_i,y_j)]

Where R_a is a function defined by the user... Perhaps R_a[z] = z^2 or R_a[z] = abs(z)

Aside: For the RNA sequencing application, something like R_a[z] = sqrt(z), z > 0 would be interesting (though not convex). Actually even something non-monotonic would be interesting for reasons I won't go into. These are probably too weird/specialized to include, but I would be curious.

madeleineudell commented 9 years ago

Interesting. I'd call that a loss function rather than a regularization on unobserved entries. I'd say your observations take values in the set [image: S = \mathbb{R} \cup \mathrm{NA}] and that your loss function is a function $L: S \times \mathbb{R} \to \mathbb{R}$. Then $L(\mathrm{NA}, z)$ has some functional form and $L(a,z)$ has a different function form for $a \ne \mathrm{NA}$.

This happens frequently in a variety of settings (recommender systems come to mind): missing entries are not missing iid; rather, missingness tells you something about the unseen value. In the recommender system context, someone who didn't rate a movie probably didn't watch the movie (or rate it) because they didn't think they'd like the movie. And often, they're right.

Putting together an example illustrating how to deal with this would be super useful.

On Thu, Aug 20, 2015 at 11:23 AM, Alex Williams notifications@github.com wrote:

In many experimental contexts (RNA-sequencing is a good example), ground truth data that are below a detection threshold are not observed due to technical error. Thus, we have a data matrix A with many NA entries. However, we suspect many of these NA entries to be small, though not necessarily zero.

For each observed entry A_ij we have a loss function: L[A_ij, dot(x_i,y_j)]

Maybe for each unobserved entry we could add some regularization: R_a[dot(x_i,y_j)]

Where R_a is a function defined by the user... Perhaps R_a[z] = z^2 or R_a[z] = abs(z)

Aside: For the RNA sequencing application, something like R_a[z] = sqrt(z), z > 0 would be interesting (though not convex). Actually even something non-monotonic would be interesting for reasons I won't go into. These are probably too weird/specialized to include, but I would be curious.

— Reply to this email directly or view it on GitHub https://github.com/madeleineudell/LowRankModels.jl/issues/37.

Madeleine Udell Postdoctoral Fellow at the Center for the Mathematics of Information California Institute of Technology www.stanford.edu/~udell (415) 729-4115

ahwillia commented 9 years ago

Exactly -- the key piece is that we would want a loss function with a different functional form for NA entries. Otherwise, you could just impute A appropriately.

Some thoughts/ideas:

Add glrm.na_losses an Array of loss functions
Each element of na_losses could be a Loss. In which case we would add a new grad function for each existing loss function. For example, for quadratic loss:

type quadratic<:DiffLoss
    scale::Float64
    domain::Domain
    na_val::Float64 # what to replace NA values with for this column
end
# Add another grad function (without a::Number given)
grad(l::quadratic, u::Float64) = (u-l.na_val)*l.scale

Alternatively, we could make a new abstract NaLoss<:Loss type... But I think the above solution is preferable in terms of avoiding code duplication.
We could add a new flag to ProxGradParams to either use or ignore glrm.na_losses

type ProxGradParams<:AbstractParams
    ...
    use_na_loss::Bool
end

Finally, we add some code to compute the gradient at the unobserved entries:

# compute gradient if L with respect to Yⱼ over unobserved values:
if params.use_na_loss
    for e in setdiff(1:m,glrm.observed_examples[f])
        axpy!(grad(na_losses[f],XY[e,f]), ve[e], g)
    end
end

Happy to implement something along these lines if you think it makes sense and fits within the scope of the project.

madeleineudell commented 9 years ago

That seems probably more complex than necessary. We can do it without touching anything in the code other than defining a new loss.

It might look like this:

type NALoss<:Loss
  na_val
  na_loss::Loss
  obs_loss::Loss
end

function evaluate(l::NALoss, u, a)
  if isna(a)
    return evaluate(l.na_loss, u, l.na_val)
  else
    return evaluate(l.obs_loss, u, a)
  end
end

etc.

It might not even need to go into the codebase but just into an example... What do you think?

ahwillia commented 8 years ago

Very nice! Much cleaner than what I suggested. I think it would be nice to put it into the codebase, but that call is above my paygrade.

madeleineudell commented 8 years ago

Sure, I think it's fine to put this loss-modifying version in the codebase since it's still essentially modular. But an example / documentation will be critical. Would you be able to take the lead on that?

On Fri, Sep 4, 2015 at 12:03 PM, Alex Williams notifications@github.com wrote:

Very nice! Much cleaner than what I suggested. I think it would be nice to put it into the codebase, but that call is above my paygrade.

— Reply to this email directly or view it on GitHub https://github.com/madeleineudell/LowRankModels.jl/issues/37#issuecomment-137823564 .

Madeleine Udell Postdoctoral Fellow at the Center for the Mathematics of Information California Institute of Technology www.stanford.edu/~udell (415) 729-4115

ahwillia commented 8 years ago

Sure, I'll have something ready by Sunday. I'll just add a bit of documentation to the README? I could also make a quick Jupyter notebook...

madeleineudell commented 8 years ago

I think it would be great to have both, particularly the notebook. On Sep 5, 2015 8:46 AM, "Alex Williams" notifications@github.com wrote:

Sure, I'll have something ready by Sunday. I'll just add a bit of documentation to the README? I could also make a quick Jupyter notebook...

— Reply to this email directly or view it on GitHub https://github.com/madeleineudell/LowRankModels.jl/issues/37#issuecomment-137969384 .

ahwillia commented 8 years ago

Interesting... I have the basic concept working, but I'm having a bit of trouble coming up with a good example for when this produces a better model (see notebook below). Do you know any relevant papers I might refer to?

https://github.com/ahwillia/LowRankModels.jl/blob/na_loss/examples/NALoss.ipynb

madeleineudell commented 8 years ago

No papers that I know of, but if we find something good we could write one :)

Try a problem with way higher sparsity. The extreme case is an example where observations are Boolean but only (some of) the +1s are observed, like the Censored PCA example (figure 6 of the Low Rank Models paper http://arxiv.org/abs/1410.0342). I'm quite curious for how the NAloss compares with extreme regularization in this case.

On Sun, Sep 6, 2015 at 8:35 PM, Alex Williams notifications@github.com wrote:

Interesting... I have the basic concept working, but I'm having a bit of trouble coming up with a good example for when this produces a better model (see notebook below). Do you know any relevant papers I might refer to?

https://github.com/ahwillia/LowRankModels.jl/blob/na_loss/examples/NALoss.ipynb

— Reply to this email directly or view it on GitHub https://github.com/madeleineudell/LowRankModels.jl/issues/37#issuecomment-138171727 .

Madeleine Udell Postdoctoral Fellow at the Center for the Mathematics of Information California Institute of Technology www.stanford.edu/~udell (415) 729-4115

ahwillia commented 8 years ago

It works now. Here is an example for the Boolean situation you suggested:

http://nbviewer.ipython.org/github/ahwillia/LowRankModels.jl/blob/na_loss/examples/NALoss_boolean.ipynb

ahwillia commented 8 years ago

nbviewer seems to be having issues. You can also see the notebook here: https://github.com/ahwillia/LowRankModels.jl/blob/na_loss/examples/NALoss_boolean.ipynb

madeleineudell commented 8 years ago

Did you want to put this notebook in the repo? eg, create a jupyter_notebook_examples directory and put it in there, then make a PR on master? I'd like to close this issue.

ahwillia commented 8 years ago

Let's close this for now. I have a collection of notebooks I can show you at some point in January. Not sure what makes the most sense to do with them, but we can either put them in here or have a separate repo.

madeleineudell commented 8 years ago

Sounds good.

On Wed, Dec 23, 2015 at 4:41 PM, Alex Williams notifications@github.com wrote:

Let's close this for now. I have a collection of notebooks I can show you at some point in January. Not sure what makes the most sense to do with them, but we can either put them in here or have a separate repo.

— Reply to this email directly or view it on GitHub https://github.com/madeleineudell/LowRankModels.jl/issues/37#issuecomment-167014280 .

Madeleine Udell Postdoctoral Fellow at the Center for the Mathematics of Information California Institute of Technology https://courses2.cit.cornell.edu/mru8 https://courses2.cit.cornell.edu/mru8 (415) 729-4115

madeleineudell / LowRankModels.jl

Penalties/regularization of imputed values #37