Closed saiedg closed 1 year ago
Thanks for testing this! It looks like the Increased Memory Limit
capability is missing (the error message says limit=2867MB, which is ~3GB - it should be ~4GB). I uploaded a project with the capability turned on, but I guess it doesn't transfer (meaning, my instructions were missing a step)
Here are the instructions for adding the capability manually: https://developer.apple.com/documentation/xcode/adding-capabilities-to-your-app#Add-a-capability.
The specific capability to add is Increased Memory Limit
:
.
Here's what Xcode should show after the Increased Memory Limit
capability is added:
After that capability is added, Maple Diffusion should no longer hit any memory limit. If you see any error like Entitlements file "maple_diffusion.entitlements" was modified during the build
, run Product > Clean Build Folder
and then build it again.
Please let me know if it works - I'm curious to know how fast Maple Diffusion runs on the new phones :)
Thank you for your fast reply. I'm still getting the memory error but now differently. I cleaned the build folder. Restarted xcode. Please let me know if you have any ideas. Looking forward to testing!
Hmm, your limit is 2867MB, even after adding the "Increased Memory Limit" entitlement π΅βπ«
This is mysterious; either:
2 seems wildly implausible. So I think it still has to be 1; the capability isn't being applied in your case, for some reason.
I see some slight differences in Xcode screenshots that make me worried about differences in the Signing
section. Your screenshot shows "Signing (Debug)" and "Signing (Release)" sections separately, but mine doesn't. I'm using Xcode Version 14.0.1 (14A400) and my signing tab looks like this (email redacted):
So, things to check:
I just ran it on my M1 iPad with no issues! So cool. iPadOS 16 is not released yet so I lowered the deloyment target to iPadOS 15.6. Works no problem. About 1.56 steps/sec. So we know it's working. Unfortunately it's looking like the iPhone 14 Pro Max is hitting a memory limit. Any ideas at all to make this work using less memory?
Cool, great to see it works on iPad!
I don't know of any easy way to get SD to run in < 3GB of memory with MPSGraph, unfortunately - I exhausted all of my tricks getting it below 4GB π ... but if I can find a way to lower it further I'll definitely update the repo
Interestingly enough, the iPad M1 does not need the 'increase memory' capabilty. Hope this is just a bug with the iPhone 14 Pro Max and that will be fixed with iOS 16.1 or you're able to find one last bit of magic to separate the memory or lower it somehow! You're incredible what you've done!
Thanks! It looks like the iPad just has a higher base memory limit (5GB instead of 3GB). If you are able to get the "increase memory" entitlement working on iPad, you may even be able to turn off the saveMemoryButBeSlower
option in ContentView.swift
to get faster performance... but since generation already seems pretty fast, maybe don't risk it π
Just tested on the iPad M1 with saveMemoryButBeSlower as false. I had to turn on increase memory... Peaked at 1.83/s! Pretty good performance increase. I am here anytime you want to test on the iPhone 14 Pro Max with any ideas you have!
Gotcha! Though it looks like the performance is actually not better with the flag changed (the progress bar is confusingly printing seconds / step, not steps / second - lower is better!)... maybe leave the saveMemoryButBeSlower
on for now π
(FWIW, repeated generations can get slower and slower if the GPU just starts getting too hot - it's possible that the saveMemoryButBeSlower
option would still be faster from a cold start)
I'll be sure to let you know if I have ideas for getting this working on the 14 Pro Max - thanks again for your help testing this out!
Hope to hear from you soon!
Hey! Thought I would chime in and confirm that I'm also running into the same issue (the RESOURCE_TYPE_MEMORY (limit=2867 MB...)
error) with the iPhone 14 Pro running iOS 16.0.3, building on an Apple Silicon Macbook Pro. I'm more than happy to help test any troubleshooting ideas if we come up with anything!
Slightly curious, I gave running os_proc_available_memory
a go to see how much memory we had to work with, and it returned 2989554560 (2989 MB?) From what I can tell, this more-or-less confirms that Increased Memory Limit
isn't working with either the iPhone 14 Pro or iOS 16.0.3.
Anyone have any suggestions?
Hi! This guy on twitter has also gotten Stable Diffusion on iOS working but it's slower than yours. He says he got most of "app running on the neural engine." Unfortunately he does not detail how. I hope maybe that will help spring up an idea for you! https://twitter.com/wattmaller1/status/1582047120327991296
I don't know of any easy way to get SD to run in < 3GB of memory with MPSGraph, unfortunately - I exhausted all of my tricks getting it below 4GB π ... but if I can find a way to lower it further I'll definitely update the repo
There is a blog post about transformer optimizations Apple applied: https://machinelearning.apple.com/research/neural-engine-transformers These are mostly about speed, but it also shows a way to reduce intermediate tensor usage by using explicit multi-head attention. At FP16, the q * k^{T}
result can use up to 500MiB and splitting into 8 would reduce that peak memory usage. It is something you probably want to try.
(This optimization is pretty low on my list, since I am looking at a more broader optimization much like xformer + bitsandbytes for the multihead attention).
@saiedg Matt is using CoreML (see this other thread) - his CoreML-based implementation seems to be moderately slower, but able to run on the neural engine, and more amenable to swapping parts of the UNet out to storage without paying a huge recompilation cost (so he can run a UNet step in under 3GB and ~5 seconds wall clock).
MPSGraph recompilation was unusably slow when I tried swapping portions of the UNet to storage iirc, and the level1
optimization flag (which seems to unlock the neural engine) gave me segfaults π€·
Anyway, possible solutions would be:
...but none of those seem easy π
@liuliu Yup! I believe I already implemented the split-across-heads-to-save-memory trick (though my implementation might have bugs). The other big missing optimization I'm aware of is Flash Attention, but I don't see any easy way to bring that to MPSGraph.
@liuliu Yup! I believe I already implemented the split-across-heads-to-save-memory trick (though my implementation might have bugs). The other big missing optimization I'm aware of is Flash Attention, but I don't see any easy way to bring that to MPSGraph.
Yeah, I don't know how to print memory allocation graph from MPSGraph to know what's going on there, otherwise we can dig to see where the extra 3+GiB memory from (the model itself (unet) in fp16 is about 1.65G)
Maybe try the following boolean in addition to com.apple.developer.kernel.increased-memory-limit entitlement
:
com.apple.developer.kernel.extended-virtual-addressing
You need to enable "Extended Virtual Address Space" manually in the App ID configuration in https://developer.apple.com/account/resources/identifiers/.
Maybe try the following boolean in addition to com.apple.developer.kernel.increased-memory-limit entitlement
I believe that @saiedg had that in their entitlements and still ran into the same issue (or at least that is what I've gathered from screenshots.) I'll give it a shot myself tonight though, since adding more virtual address space shouldn't hurt. I'll let you know how it goes!
This is what I tested with. I will test again on monday with iOS 16.1 and an updated xcode.
Anyway, possible solutions would be:
1. Find some way to get the 4GB limit unlocked on the iPhone 14s 2. Find some tricks to make this MPSGraph version use <3GB without being substantially slower 3. Re-implement the UNet with some non-MPSGraph API so it uses <3GB without being substantially slower. Possible APIs: 3.1 CoreML 3.2 MPS + Metal
...but none of those seem easy π
Just give you some updates on my end, I switched softmax from MPSGraph to MPSMatrixSoftMax and some GEMM from MPSGraph to MPSMatrixMultiplication. This helps because in MPSGraph, it doesn't do inplace softmax (0.5G) and it seems when I copy data out of MPSGraph, there are extra scratch space for GEMM (another 0.5G for the dot product of q, k). Combining these two, I was able to run the model around 2GiB without perf penalty (thus, 1.6 it / s on M1 and ~2 it / s on iPhone 4 Pro).
@liuliu that's great! Well done! Can you upload it??
Hi, these are not done with maple-diffusion but against my own implementation, which is meaningfully different to make similar changes in maple-diffusion difficult. (maple-diffusion uses MPSGraph as a complete solution and generate the full graph while I use MPSGraph more like how PyTorch does it, as individual op). The comment here is more as a potential direction for @madebyollin to see whether some of the learnings can be applicable here.
Update: Looks like it's working now on iOS 16.1 stable!
I think that once someone else can confirm this we can close this issue!
Great. I upgraded my iPhone 14 pro from 16.0.2 to 16.1.1. It can run without prompting memory errorsγ
I can confirm that it's fixed in 16.1. I got a user has the exactly same issue on 16.0 iPhone Pro 14 but solved after upgraded to 16.1!
Perfect, thanks for confirming @HelixNGC7293 and @hubin858130!
@madebyollin I think this case is more or less resolved, seeing as an iOS update solved it.
Cool - thanks to everyone for testing and verifying this (and to whoever at π fixed the low limit)! I'll mark it closed, I guess :)
I am running xcode on an intel Mac running macOS 12.6 and trying to install the app on my iPhone 14 Pro Max. After downloading a stable diffusion model checkpoint, downloading maple-diffusion.git, and running the code to convert to fp16 binary blobs, i'm getting this memory terminated error on my iPhone 14 Pro Max running iOS 16.0.3. Any ideas?