🐛 Bug

python caption.py
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.56it/s]
xla:0
Traceback (most recent call last):
  File "/home/kojoe/EasyAnimate/easyanimate/image_caption/caption.py", line 83, in <module>
    generated_ids = model.generate(**inputs, max_new_tokens=512)
  File "/home/kojoe/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/kojoe/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2015, in generate
    result = self._sample(
  File "/home/kojoe/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2948, in _sample
    model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs)
  File "/home/kojoe/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1387, in _get_initial_cache_position
    cache_position = torch.ones_like(input_ids[0, :], dtype=torch.int64).cumsum(0) - 1
RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 29.96M of 16.00M vmem. Exceeded vmem capacity by 13.96M.

Program vmem requirement 29.96M:
    scoped           29.96M

  Largest program allocations in vmem:

  1. Size: 29.67M
     XLA label: register allocator spill slots call depth 2
     Allocation type: scoped
     ==========================

  2. Size: 64.0K
     Shape: f32[128,8]{1,0}
     Unpadded size: 4.0K
     Extra memory due to padding: 60.0K (16.0x expansion)
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

  3. Size: 64.0K
     Shape: f32[128,8]{1,0}
     Unpadded size: 4.0K
     Extra memory due to padding: 60.0K (16.0x expansion)
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

  4. Size: 64.0K
     Shape: f32[128,8]{1,0}
     Unpadded size: 4.0K
     Extra memory due to padding: 60.0K (16.0x expansion)
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

  5. Size: 64.0K
     Shape: f32[128,8]{1,0}
     Unpadded size: 4.0K
     Extra memory due to padding: 60.0K (16.0x expansion)
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

  6. Size: 8.0K
     Shape: u8[8192]{0}
     Unpadded size: 8.0K
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

  7. Size: 8.0K
     Shape: u8[8192]{0}
     Unpadded size: 8.0K
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

  8. Size: 8.0K
     Shape: u8[8192]{0}
     Unpadded size: 8.0K
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

  9. Size: 8.0K
     Shape: u8[8192]{0}
     Unpadded size: 8.0K
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

  10. Size: 4.0K
     Shape: f32[8,128]{1,0}
     Unpadded size: 4.0K
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

  11. Size: 4.0K
     Shape: f32[8,128]{1,0}
     Unpadded size: 4.0K
     XLA label: reduce-window.1 = reduce-window(fusion.5, fusion.4, constant.2, constant.2), window={size=1x128 pad=0_0x127_0}, to_apply=AddComputation.3.clone
     Allocation type: scoped
     ==========================

To Reproduce

Steps to reproduce the behavior:

Script


from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import numpy as np
import torch
import torch_xla as xla
import torch_xla.core.xla_model as xm
import torch_xla.distributed.spmd as xs

from torch.distributed._tensor import DeviceMesh, distribute_module from torch_xla.distributed.spmd import auto_policy

from torch_xla import runtime as xr from torch_xla.experimental.spmd_fully_sharded_data_parallel import ( _prepare_spmd_partition_spec, SpmdFullyShardedDataParallel as FSDPv2, )

import time

xla.experimental.eager_mode(True) start = time.time()

device = xla.device()

default: Load the model on the available device(s)

model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-2B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="eager", device_map="auto", ).to(device)

print(model.device)

default processer

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4")

message = [ { "role": "user", "content": [ { "type": "image", "image": "https://w0.peakpx.com/wallpaper/607/308/HD-wallpaper-anime-girl-black-hair-guitar-instrument-red-eyes-school-uniform-skirt.jpg", }, {"type": "text", "text": "Describe this image in detail."}, ], } ]

allmessages = [[message] for in range(1)] for messages in all_messages:

# Preparation for inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :]
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)
for i, text in enumerate(output_text):
    print(f"Output {i}: {text}")

print(f"Time taken: {time.time() - start}")



<!-- If you have a code sample, error messages, stack traces, please provide it here as well. Or better use the Colab template: https://github.com/pytorch/xla/blob/master/contrib/colab/issue-report.ipynb -->

## Expected behavior

<!-- A clear and concise description of what you expected to happen. -->

Should run the Qwen2-VL-2B-Instruct model at https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct#quickstart

## Environment

 - Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
 - torch_xla version: NIghtly 2.5

## Additional context

<!-- Add any other context about the problem here. -->

pytorch / xla

Ran out of memory in memory space vmem / register allocator spill slots call depth 2 #7962

🐛 Bug

To Reproduce

default: Load the model on the available device(s)

default processer