Open jiahao opened 1 year ago
See also their patent application https://patents.google.com/patent/US20230106213A1/en
It looks like the weighted Hadamard generalization of the SVD has already been solved in the nearest correlation matrix problem, for which a Newton method is available. That should be much more efficient than gradient descent or projected descent.
Adding also the related GPTQ method for quantization, which uses the data Hessian, $XX'$, in effect computing the same empirical Fisher-based weight matrix, except using the least-squares loss instead. They use an ad hoc modified Cholesky which can probably be improved upon with a more robust implementation.
Paper references (Thanks @evanmiller)
An explicit formula for the Fisher weight matrix for the cross-entropy loss (which is used in llama) is given in §A.2 of