CPU and MPS (Apple Silicon) backends

stared commented 1 year ago

I wanted to add the MPS backend (M1/M2 Apple Silicon GPU) and regular CPUs. Performance-wise, these implementations won't be anywhere near CUDA + xformers. Yet, I think it is nice to add it for the sake of compatibility, to make it possible to run on any device, even if slowly.

[X] CPU device
[X] MPS device
[x] De-hardcoding and automatic detection
[x] Readme

haotian-liu commented 1 year ago

Thanks for the contribution and this looks great! Let's work towards that the demo can work efficiently on both NVIDIA GPUs/CPUs/MacOS.

stared commented 1 year ago

@haotian-liu Thank you for these kind words! And in the first place, for this wonderful model.

So, I added device autodetection, with priorities CUDA > MPS > CPU. So, the CPU fallback is only when GPUs are not available.

A very callback estimate on my Macbook Pro M1 (2021) (benchmark stats here), a rough estimate is:

CPU: 748-1200s
MPS: 218s

For comparison, on NVIDIA T4 (with xformers, without triton), I got 27s.

haotian-liu commented 1 year ago

Hi @stared Thanks for the great work!

Several things:

I tried running on my M2 Macbook Air (16 GB). It takes at least 300seconds and I did not see it finish. Although I believe the code is working, while just being slow. What's your Macbook and how big is your RAM?
What is mamba and how is that different from miniconda? I was able to configure the environment successfully with miniconda (arm).
I replaced the transformer version to the latest as 4.19.2 does not work on my M2 -- it tries to compile some version of tokenizer, while failed. Is that also happening on your side?

stared commented 1 year ago

@haotian-liu I made sure it works on my laptop. Sadly, I have no other Apple Silicon laptops I can test on. Normally I would use GH Actions to test that, but in this case, I am not sure if it is doable.

In any case, I have Macbook Pro with 32GB RAM. As you know, this RAM is shared with GPU, so it may matter a lot for the model performance. I have heard that when memory is at 80% or more, the performance falls drastically.

I use mamba as it is a drop-in replacement for conda, and is much faster with solving dependencies. At the time I started using mamba, it had better support arm64. In fact, the regular conda still has a disclaimer:

Apple silicon builds are experimental and haven't had testing like the other platforms

See more at https://github.com/conda-forge/miniforge. That said, I guess it should work with regular miniforge.

When it comes to versions - I bumped it up. As a side remark, I had to upgrade Pytorch as 1.12.1 didn't support atten:index.Tensor on MPS. I was happy to see that the progress is fast, and that the current version supports this operation.

haotian-liu commented 1 year ago

Thank you for the explanation! I am merging this pull request, and thank you again for the contribution!

gligen / GLIGEN

CPU and MPS (Apple Silicon) backends #14