Open pudepiedj opened 7 months ago
Hi, I try to implement the method. However, the key operation (SVD) is still not supported by MLX. Do you have any ideas?
Should be ready soon https://github.com/ml-explore/mlx/pull/809, although that will only run on the CPU so it may be too slow depending on how often you use it.
Hi, I have already implemented Galore on Apple Silicon. It can reduce the memory usage. However, due to the SVD cannot be run on GPU, it's slow. I am wondering that there are any documents about how to accelerate SVD by GPU?
@LiuChaoXD issue 809 refered to in the response above your comment was merged into mlx main. So mlx now has an SVD and you can implement the galore optimizer in mlx.
I am wondering that there are any documents about how to accelerate SVD by GPU?
I would start by learning about parallel implementations of SVD in general (maybe try searching for how one would do this in Cuda or something).
are any documents about how to accelerate SVD by GPU?
I found this chapter on parallelising SVD: https://www.irisa.fr/sage/bernard/publis/SVD-Chapter06.pdf
I have just read the very recent paper
[GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]
(arXiv 2403:03507) that allows Llama 7B training to be run on an RTX 4090 in 24GB of VRAM. On the face of it this suggests that the same technique could be applied on Apple Silicon, but I don't know enough about the relative parameters of the 4090 and M1, M2, M3 chips to know how tractable that is, or whether the MLX project yet supports sufficient Torch-like ops.Can anyone comment? Should a 32GB (38 core) M2 MAX be able to do the same thing? How long would it take?