joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
831 stars 62 forks source link

implement sparse PCA/dictionary-encoding #254

Open ExpandingMan opened 1 year ago

ExpandingMan commented 1 year ago

I'm interested in implementing sparse dictionary learning and sparse PCA. These techniques are implemented in scikit-learn, see here, references therein.

I might take this on if I deem it a sufficiently short project.

joshday commented 1 year ago

Looks cool! If you go for it, let me know if you have any questions.

ExpandingMan commented 1 year ago

Alright, I've read those papers this morning and have some thoughts.

Both of these methods essentially amount to a matrix factorization of the training data $UV$ with various constraints on the form of these matrices, in particular that they be sparse (or just $V$ be sparse). As far as I can tell the only real difference between these two methods is that "sparse PCA" constraints $V^{T}V=1$ whereas "dictionary learning" provides no such guarantees. The author of the latter imply this is an advantage, though at first glance it seems to me like a disadvantage. The dictionary learning, at least in that particular paper, is explicitly an online method whereas it's not obvious to me how to do sparse PCA without a full batch. My first guess would probably be to see if I could adapt the dictionary learning algorithm presented in this paper to the orthogonality constraint, but I've not thought this through.

So, I'm still seriously considering implementing dictionary learning as it should be relatively straightforward. The only caveat as far as a PR to OnlineStats.jl goes is that it necessarily involves an $\ell_1$ regularized regression. I think it would be wiser to use Lasso.jl or something similar for this rather than attempt to implement it from scratch, which would necessitate adding a dependency.

What dictionary learning would offer over anything that's currently available in OnlineStats.jl is a dimensionality reduction technique with a sparse transformation matrix, which would be useful since one can easily wind up with a fairly enormous matrix for high-dimensional training data.

Therefore, I think, assuming I follow through and do this, I will create a new package which has OnlineStats.jl as a dependency. Then, if and when I finish, I will post a link here and you can review, and if you'd like to merge it into OnlineStats.jl I'd be happy to oblige.

joshday commented 1 year ago

Sounds like a good plan to me! FYI you may be able to get away with just OnlineStatsBase as a dependency.

ExpandingMan commented 1 year ago

Alright, I lost a ton of interest in this method as I have realized that it probably really sucks for dimensionality reduction in most cases. It looks like a really good compression method, but since it achieves its compression only by sparsity you tend to leave with more dimensions than you came in. It might work well for dimensional reduction in some cases if the initial conditions are chosen appropriately, but I haven't discovered how to do that yet.

Anyway, I thought I should note here that I have OnlineSparseLearning.jl up, which seems to be basically working but is certainly not usable yet. I may or may not finish this depending on my needs.