madebyollin / maple-diffusion

Stable Diffusion inference on iOS / macOS using MPSGraph
https://madebyoll.in/posts/maple_diffusion/
MIT License
799 stars 51 forks source link

Terminated due to memory issue on iPhone 14 Pro Max #5

Closed saiedg closed 1 year ago

saiedg commented 2 years ago

I am running xcode on an intel Mac running macOS 12.6 and trying to install the app on my iPhone 14 Pro Max. After downloading a stable diffusion model checkpoint, downloading maple-diffusion.git, and running the code to convert to fp16 binary blobs, i'm getting this memory terminated error on my iPhone 14 Pro Max running iOS 16.0.3. Any ideas?

Screen Shot 2022-10-14 at 4 21 14 AM Screen Shot 2022-10-14 at 4 35 00 AM
madebyollin commented 2 years ago

Thanks for testing this! It looks like the Increased Memory Limit capability is missing (the error message says limit=2867MB, which is ~3GB - it should be ~4GB). I uploaded a project with the capability turned on, but I guess it doesn't transfer (meaning, my instructions were missing a step)

Here are the instructions for adding the capability manually: https://developer.apple.com/documentation/xcode/adding-capabilities-to-your-app#Add-a-capability.

The specific capability to add is Increased Memory Limit: image.

Here's what Xcode should show after the Increased Memory Limit capability is added:

image

After that capability is added, Maple Diffusion should no longer hit any memory limit. If you see any error like Entitlements file "maple_diffusion.entitlements" was modified during the build, run Product > Clean Build Folder and then build it again.

Please let me know if it works - I'm curious to know how fast Maple Diffusion runs on the new phones :)

saiedg commented 2 years ago

Thank you for your fast reply. I'm still getting the memory error but now differently. I cleaned the build folder. Restarted xcode. Please let me know if you have any ideas. Looking forward to testing!

Screen Shot 2022-10-14 at 5 23 21 PM Screen Shot 2022-10-14 at 5 26 55 PM Screen Shot 2022-10-14 at 5 27 02 PM
madebyollin commented 2 years ago

Hmm, your limit is 2867MB, even after adding the "Increased Memory Limit" entitlement πŸ˜΅β€πŸ’«

image

This is mysterious; either:

  1. The entitlement is displayed but not actually getting applied, or
  2. 2867MB is actually the "increased" limit, and the base memory limit is somehow lower on iPhone 14 Pro than on iPhone 13 Pro (???)

2 seems wildly implausible. So I think it still has to be 1; the capability isn't being applied in your case, for some reason.

I see some slight differences in Xcode screenshots that make me worried about differences in the Signing section. Your screenshot shows "Signing (Debug)" and "Signing (Release)" sections separately, but mine doesn't. I'm using Xcode Version 14.0.1 (14A400) and my signing tab looks like this (email redacted):

image

So, things to check:

saiedg commented 2 years ago

I just ran it on my M1 iPad with no issues! So cool. iPadOS 16 is not released yet so I lowered the deloyment target to iPadOS 15.6. Works no problem. About 1.56 steps/sec. So we know it's working. Unfortunately it's looking like the iPhone 14 Pro Max is hitting a memory limit. Any ideas at all to make this work using less memory?

IMG_0341 IMG_0342

madebyollin commented 2 years ago

Cool, great to see it works on iPad!

I don't know of any easy way to get SD to run in < 3GB of memory with MPSGraph, unfortunately - I exhausted all of my tricks getting it below 4GB πŸ˜…... but if I can find a way to lower it further I'll definitely update the repo

saiedg commented 2 years ago

Interestingly enough, the iPad M1 does not need the 'increase memory' capabilty. Hope this is just a bug with the iPhone 14 Pro Max and that will be fixed with iOS 16.1 or you're able to find one last bit of magic to separate the memory or lower it somehow! You're incredible what you've done!

madebyollin commented 2 years ago

Thanks! It looks like the iPad just has a higher base memory limit (5GB instead of 3GB). If you are able to get the "increase memory" entitlement working on iPad, you may even be able to turn off the saveMemoryButBeSlower option in ContentView.swift to get faster performance... but since generation already seems pretty fast, maybe don't risk it πŸ˜†

saiedg commented 2 years ago

Just tested on the iPad M1 with saveMemoryButBeSlower as false. I had to turn on increase memory... Peaked at 1.83/s! Pretty good performance increase. I am here anytime you want to test on the iPhone 14 Pro Max with any ideas you have!

IMG_0344

madebyollin commented 2 years ago

Gotcha! Though it looks like the performance is actually not better with the flag changed (the progress bar is confusingly printing seconds / step, not steps / second - lower is better!)... maybe leave the saveMemoryButBeSlower on for now πŸ˜†

(FWIW, repeated generations can get slower and slower if the GPU just starts getting too hot - it's possible that the saveMemoryButBeSlower option would still be faster from a cold start)

I'll be sure to let you know if I have ideas for getting this working on the 14 Pro Max - thanks again for your help testing this out!

saiedg commented 2 years ago

Hope to hear from you soon!

simonerlic commented 2 years ago

Hey! Thought I would chime in and confirm that I'm also running into the same issue (the RESOURCE_TYPE_MEMORY (limit=2867 MB...) error) with the iPhone 14 Pro running iOS 16.0.3, building on an Apple Silicon Macbook Pro. I'm more than happy to help test any troubleshooting ideas if we come up with anything!

Slightly curious, I gave running os_proc_available_memory a go to see how much memory we had to work with, and it returned 2989554560 (2989 MB?) From what I can tell, this more-or-less confirms that Increased Memory Limit isn't working with either the iPhone 14 Pro or iOS 16.0.3.

Anyone have any suggestions?

saiedg commented 2 years ago

Hi! This guy on twitter has also gotten Stable Diffusion on iOS working but it's slower than yours. He says he got most of "app running on the neural engine." Unfortunately he does not detail how. I hope maybe that will help spring up an idea for you! https://twitter.com/wattmaller1/status/1582047120327991296

liuliu commented 2 years ago

I don't know of any easy way to get SD to run in < 3GB of memory with MPSGraph, unfortunately - I exhausted all of my tricks getting it below 4GB πŸ˜…... but if I can find a way to lower it further I'll definitely update the repo

There is a blog post about transformer optimizations Apple applied: https://machinelearning.apple.com/research/neural-engine-transformers These are mostly about speed, but it also shows a way to reduce intermediate tensor usage by using explicit multi-head attention. At FP16, the q * k^{T} result can use up to 500MiB and splitting into 8 would reduce that peak memory usage. It is something you probably want to try.

(This optimization is pretty low on my list, since I am looking at a more broader optimization much like xformer + bitsandbytes for the multihead attention).

madebyollin commented 2 years ago

@saiedg Matt is using CoreML (see this other thread) - his CoreML-based implementation seems to be moderately slower, but able to run on the neural engine, and more amenable to swapping parts of the UNet out to storage without paying a huge recompilation cost (so he can run a UNet step in under 3GB and ~5 seconds wall clock).

MPSGraph recompilation was unusably slow when I tried swapping portions of the UNet to storage iirc, and the level1 optimization flag (which seems to unlock the neural engine) gave me segfaults 🀷

Anyway, possible solutions would be:

  1. Find some way to get the 4GB limit unlocked on the iPhone 14s
  2. Find some tricks to make this MPSGraph version use <3GB without being substantially slower
  3. Re-implement the UNet with some non-MPSGraph API so it uses <3GB without being substantially slower. Possible APIs: 3.1 CoreML 3.2 MPS + Metal

...but none of those seem easy πŸ˜…

@liuliu Yup! I believe I already implemented the split-across-heads-to-save-memory trick (though my implementation might have bugs). The other big missing optimization I'm aware of is Flash Attention, but I don't see any easy way to bring that to MPSGraph.

liuliu commented 2 years ago

@liuliu Yup! I believe I already implemented the split-across-heads-to-save-memory trick (though my implementation might have bugs). The other big missing optimization I'm aware of is Flash Attention, but I don't see any easy way to bring that to MPSGraph.

Yeah, I don't know how to print memory allocation graph from MPSGraph to know what's going on there, otherwise we can dig to see where the extra 3+GiB memory from (the model itself (unet) in fp16 is about 1.65G)

ParityError commented 2 years ago

Maybe try the following boolean in addition to com.apple.developer.kernel.increased-memory-limit entitlement:

com.apple.developer.kernel.extended-virtual-addressing 

You need to enable "Extended Virtual Address Space" manually in the App ID configuration in https://developer.apple.com/account/resources/identifiers/.

simonerlic commented 2 years ago

Maybe try the following boolean in addition to com.apple.developer.kernel.increased-memory-limit entitlement

I believe that @saiedg had that in their entitlements and still ran into the same issue (or at least that is what I've gathered from screenshots.) I'll give it a shot myself tonight though, since adding more virtual address space shouldn't hurt. I'll let you know how it goes!

saiedg commented 2 years ago

This is what I tested with. I will test again on monday with iOS 16.1 and an updated xcode.

Screen Shot 2022-10-19 at 6 07 54 PM
liuliu commented 2 years ago

Anyway, possible solutions would be:

1. Find some way to get the 4GB limit unlocked on the iPhone 14s

2. Find some tricks to make this MPSGraph version use <3GB without being substantially slower

3. Re-implement the UNet with some non-MPSGraph API so it uses <3GB without being substantially slower. Possible APIs:
   3.1 CoreML
   3.2 MPS + Metal

...but none of those seem easy πŸ˜…

Just give you some updates on my end, I switched softmax from MPSGraph to MPSMatrixSoftMax and some GEMM from MPSGraph to MPSMatrixMultiplication. This helps because in MPSGraph, it doesn't do inplace softmax (0.5G) and it seems when I copy data out of MPSGraph, there are extra scratch space for GEMM (another 0.5G for the dot product of q, k). Combining these two, I was able to run the model around 2GiB without perf penalty (thus, 1.6 it / s on M1 and ~2 it / s on iPhone 4 Pro).

saiedg commented 2 years ago

@liuliu that's great! Well done! Can you upload it??

liuliu commented 2 years ago

Hi, these are not done with maple-diffusion but against my own implementation, which is meaningfully different to make similar changes in maple-diffusion difficult. (maple-diffusion uses MPSGraph as a complete solution and generate the full graph while I use MPSGraph more like how PyTorch does it, as individual op). The comment here is more as a potential direction for @madebyollin to see whether some of the learnings can be applicable here.

simonerlic commented 2 years ago

Update: Looks like it's working now on iOS 16.1 stable!

I think that once someone else can confirm this we can close this issue!

hubin858130 commented 1 year ago

Great. I upgraded my iPhone 14 pro from 16.0.2 to 16.1.1. It can run without prompting memory errors。 run ipinfo

HelixNGC7293 commented 1 year ago

I can confirm that it's fixed in 16.1. I got a user has the exactly same issue on 16.0 iPhone Pro 14 but solved after upgraded to 16.1!

simonerlic commented 1 year ago

Perfect, thanks for confirming @HelixNGC7293 and @hubin858130!

@madebyollin I think this case is more or less resolved, seeing as an iOS update solved it.

madebyollin commented 1 year ago

Cool - thanks to everyone for testing and verifying this (and to whoever at 🍎 fixed the low limit)! I'll mark it closed, I guess :)