Closed bghira closed 4 months ago
Can this be related to the device "mps"? I am not able to reproduce on "cuda"
happens on CPU too, and other people on the Community page of the 34b model have reported the same issue
CPU Inputs: tensor([[ 6, 1328, 144, 47329, 567, 3275, 98, 7, 6, 2942,
144, 64000, 59568, 144, 5697, 620, 2709, 594, 719, 2728,
100, 7, 6, 14135, 144]])
GPU Inputs: tensor([[ 6, 1328, 144, 47329, 567, 3275, 98, 7, 6, 2942,
144, 64000, 59568, 144, 5697, 620, 2709, 594, 719, 2728,
100, 7, 6, 14135, 144]], device='mps:0')
@bghira I'm unable to reproduce on CPU. Can you show the MRE you have for CPU? Ie. a script/gist that I can run top to bottom that reproduces what you're seeing.
On MPS I'm able to reproduce, and I've verified there's a bug (in PyTorch/MPS). I have a way to mitigate the issue, but it's suboptimal to merge this approach in transformers since that specific bug is only affecting MPS. Please let me know if you want the transformers workaround while I work on getting this fixed in PyTorch.
@hvaara Thanks for looking into this!
If you have a workaround, it would be great if you could share. As you say, it doesn't make sense to merge it on the transformers side but potentially very useful for the community whilst the fix is still to be merged into PyTorch.
hmm, i can't reproduce it on CPU, but it also never really returns anything now π might have seen this error on a different version when I saw it there. maybe now it's simply working, computing the result.
is there a link to upstream bug?
Sorry, I completely forgot about this. I'll get both the patch to upstream and a workaround for transformers out ASAP (aiming for within a few hours).
The code changes for a workaround in transformers can be seen in https://github.com/hvaara/transformers/commit/ed2f0df5992fbe11521a725efa65c99262633913.
I've created a gist to show how to run the model with these changes patched in. Note that I'm using llava-hf/llava-v1.6-mistral-7b-hf
in the gist as opposed to llava-hf/llava-v1.6-34b-hf
from https://github.com/huggingface/transformers/issues/30294#issue-2248408888.
import os
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
is needed because the aten::isin
operator is not natively implemented for MPS yet. This snippets lets PyTorch fall back to the CPU for that specific operator. Once https://github.com/pytorch/pytorch/pull/124896 is merged it should be available in the nightlies of PyTorch. I think falling back to the CPU for our use-case in transformers will not have a significant impact on performance.
To answer your original question regarding a link to a bug in upstream, there isn't one yet (AFAIK). I think I'll skip the ceremony of creating a bug, and jump straight to a pull request since I have the code fix and regression test ready for it already. I'll provide an update when I've done that with a link to the PR. Please feel free to nudge me again if there isn't an update in a timely manner ;)
@pcuenca will love that! you're the best @hvaara
Thanks for the ping @bghira :) Very nice work @hvaara π Just fyi, note that an mps
-friendly path for isin()
was recently merged in #30376 :)
@pcuenca That's great! Thanks for letting us know π
https://github.com/pytorch/pytorch/issues/96614 describes the problem in MPS/PyTorch seen in this issue.
PyTorch PR: https://github.com/pytorch/pytorch/pull/125318
https://github.com/pytorch/pytorch/pull/125318 has been merged. I don't know exactly when a new release will be cut, but I'd expect it to be available in the nightlies in <24h. Assuming it's not reverted of course ;)
pytorch/pytorch#125318 has been merged
Nice! π
I guess this issue can be closed? :)
yes, this error is fixed but the model still just doesn't work on MPS... big sad
For the 34B version? Does the 7b version work as expected? What's the failure mode? How do you provoke it? What do you expect to see?
yes, the 34b on 128G M3 Max.
it just gets stuck and never returns. i left it for >12 hours overnight. how long is it expected to take?
i do not have the network bandwidth to download the 7B as well yet, i never focused on it because it performed very poorly in my group's initial tests. it is very hamstrung by the 7B LLM.
What's the code/prompt you used? Can't guarantee I know how to fix it, but I want to try to replicate what you're seeing.
I don't know about speed, but I'd expect it to be measured in iterations per second, not iterations per day π
used the same demo code as initially posted:
[:)] % python llava.py
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 15/15 [00:00<00:00, 22.16it/s]
I tested on an M3 Max 128 GB.
The problem you're seeing seems to be due to a resource utilization issue (see https://gist.github.com/hvaara/f8f911c6e271cae8aa0e4811b24d4eb5). I suspect another issue with MPS in PyTorch (memory leak in the MPS cache).
I'll try to give another update with a workaround in a couple of hours. Nudge me if I forget.
oh π€ can we disable the mps cache? i assume leaked objects means we can't even sync and force gc to clear them
Yup. The proposed workaround was doing torch.mps.empty_cache()
in the loop:
for chunk in tqdm(streamer, total=max_new_tokens+2):
text.append(chunk)
print(chunk)
+ torch.mps.empty_cache()
print(f"{torch.mps.driver_allocated_memory()/1024**3 = }")
print(f"{torch.mps.current_allocated_memory()/1024**3 = }")
print(f"{len(text)}")
I'll know if this worked very soon π
It worked https://gist.github.com/hvaara/9a7c361478ad8aa429d8172b3119a657.
Are you able to reproduce on your end?
i can't really run the current pytorch nightly, so i'm unable to test this anymore. the error I receive is now inside pytorch, some hardcoded cuda codepath.
current pytorch 2.3 is back to the:
ValueError: The input provided to the model are wrong. The number of image tokens is 1 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.
current pytorch 2.4:
Backend not supported for None
Original traceback:
None
when looking through the pytorch codebase for this error, it's in some rng function that seems to not be called from anywhere :D
ok, by updating that method and adding a helper:
@run_and_save_rng_state.py_impl(DispatchKey.MPS)
def impl_mps(op, *args, **kwargs):
return torch.mps.get_rng_state(), op(*args, **kwargs)
@run_and_save_rng_state.py_impl(DispatchKey.BackendSelect)
def impl_backend_select(op, *args, **kwargs):
impl_map = {"cuda": impl_cuda, "cpu": impl_cpu, "mps": impl_mps}
device = get_device(args, kwargs)
assert device in impl_map, f"Backend not supported for {device}"
impl = impl_map[device]
return impl(op, *args, **kwargs)
i can get past the errors on pytorch nightly:
Class: <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 15/15 [01:51<00:00, 7.45s/it]
Processing inputs.
Generating completion for prompt: "<|im_start|>system
Answer the questions.<|im_end|><|im_start|>user
<image>
What is shown in this image?<|im_end|><|im_start|>assistant
"
Emptying MPS cache. Memory allocated: 75535089664 bytes.
Emptying MPS cache. Memory allocated: 78829928448 bytes.
Emptying MPS cache. Memory allocated: 80082075648 bytes.
Emptying MPS cache. Memory allocated: 80195469312 bytes.
Emptying MPS cache. Memory allocated: 80881369088 bytes.
Emptying MPS cache. Memory allocated: 80881500160 bytes.
seems marginal but it's 'overing around that value
The reason it works on PyTorch nightly is probably because of https://github.com/pytorch/pytorch/pull/125318. If you can't use the nightlies, does it work if you apply https://github.com/hvaara/transformers/commit/ed2f0df5992fbe11521a725efa65c99262633913 as a workaround in transformers (with PyTorch 2.3)?
From your output it looks like it's working?
it appears equivalent to using pytorch 2.4 with my additional fix for rng impl
Great! Is everything working as you'd expect on your end now? (Although it's currently held together with string and duct tape π )
it appears this is the best we can do on mps for now though it's so slow i've never seen the example finish :( i'll have to try downloading the smaller model at some point.
Unfortunately the 34B model is quite slow on this hardware. Quantization and efficient alternative attention mechanisms with MPS support are still limited. This will eventually change. The 7B model is much faster.
FWIW I've allocated resources to find and hopefully mitigate the second batch of issues we saw (MPS cache).
yes, MPS unfortunately seems to receive less love than MLX does?
Short update: @skotapati (et al?) has found a solution to a memory leak in pooling (see https://github.com/pytorch/pytorch/issues/125217) π
The fix will be in an upcoming macOS release. I'm hopeful this will resolve the bulk of the memory congestion we saw in LLaVA-NeXT 34B.
I had nothing to do with the fix, just wanted to share the good news π
/cc @bghira
System Info
transformers
version: 4.40.0.dev0Who can help?
@amyeroberts @arth
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-34b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-34b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True) model.to("mps:0")
prepare image and text prompt, using the appropriate prompt template
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true" image = Image.open(requests.get(url, stream=True).raw) prompt = "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n"
inputs = processor(prompt, image, return_tensors="pt").to("mps:0")
autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
There is apparently some confusion on the other open issue about this error message, but it seems to be a distinctly different problem.
I can't use the Git version of Transformers or the latest release of Tokenizers, because this library (for SOME reason) needs an older release of Tokenizers.
Expected behavior
The demo should work.