ml-explore / mlx-examples

Examples in the MLX framework
MIT License
6.08k stars 865 forks source link

GaLore process on Apple Silicon? #556

Open pudepiedj opened 7 months ago

pudepiedj commented 7 months ago

I have just read the very recent paper [GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection](arXiv 2403:03507) that allows Llama 7B training to be run on an RTX 4090 in 24GB of VRAM. On the face of it this suggests that the same technique could be applied on Apple Silicon, but I don't know enough about the relative parameters of the 4090 and M1, M2, M3 chips to know how tractable that is, or whether the MLX project yet supports sufficient Torch-like ops.

Can anyone comment? Should a 32GB (38 core) M2 MAX be able to do the same thing? How long would it take?

LiuChaoXD commented 7 months ago

Hi, I try to implement the method. However, the key operation (SVD) is still not supported by MLX. Do you have any ideas?

awni commented 7 months ago

Should be ready soon https://github.com/ml-explore/mlx/pull/809, although that will only run on the CPU so it may be too slow depending on how often you use it.

LiuChaoXD commented 7 months ago

Hi, I have already implemented Galore on Apple Silicon. It can reduce the memory usage. However, due to the SVD cannot be run on GPU, it's slow. I am wondering that there are any documents about how to accelerate SVD by GPU?

MichelNivard commented 7 months ago

@LiuChaoXD issue 809 refered to in the response above your comment was merged into mlx main. So mlx now has an SVD and you can implement the galore optimizer in mlx.

awni commented 7 months ago

I am wondering that there are any documents about how to accelerate SVD by GPU?

I would start by learning about parallel implementations of SVD in general (maybe try searching for how one would do this in Cuda or something).

MichelNivard commented 7 months ago

are any documents about how to accelerate SVD by GPU?

I found this chapter on parallelising SVD: https://www.irisa.fr/sage/bernard/publis/SVD-Chapter06.pdf