Implement Fisher-weighted SVD compression

rai-llc / LanguageModels.jl

Load nanoGPT-style transformers in Julia. Code ported from @karpathy's llama2.c

MIT License

59 stars 2 forks source link

Implement Fisher-weighted SVD compression #8

Open jiahao opened 1 year ago

jiahao commented 1 year ago

Paper references (Thanks @evanmiller)

Numerical Optimizations for Weighted Low-rank Estimation on Language Model - looks like these authors have rediscovered the alternating projections method of (Higham, 2002, Algorithm 3.3)
Language model compression with weighted low-rank factorization

An explicit formula for the Fisher weight matrix for the cross-entropy loss (which is used in llama) is given in §A.2 of

Limitations of the Empirical Fisher Approximation for Natural Gradient Descent

evanmiller commented 1 year ago

See also their patent application https://patents.google.com/patent/US20230106213A1/en

jiahao commented 1 year ago

It looks like the weighted Hadamard generalization of the SVD has already been solved in the nearest correlation matrix problem, for which a Newton method is available. That should be much more efficient than gradient descent or projected descent.

jiahao commented 1 year ago

Adding also the related GPTQ method for quantization, which uses the data Hessian, $XX'$, in effect computing the same empirical Fisher-based weight matrix, except using the least-squares loss instead. They use an ad hoc modified Cholesky which can probably be improved upon with a more robust implementation.

https://github.com/IST-DASLab/gptq