🐛 Bug

The error seems to be related to pixel_values being padded

WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
config.json: 100%|█████████████████████████████████████████████████████████| 3.95k/3.95k [00:00<00:00, 23.8MB/s]
configuration_internvl_chat.py: 100%|██████████████████████████████████████| 3.85k/3.85k [00:00<00:00, 26.1MB/s]
configuration_intern_vit.py: 100%|█████████████████████████████████████████| 5.55k/5.55k [00:00<00:00, 29.6MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
configuration_internlm2.py: 100%|██████████████████████████████████████████| 7.00k/7.00k [00:00<00:00, 40.8MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- configuration_internvl_chat.py
- configuration_intern_vit.py
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_internvl_chat.py: 100%|███████████████████████████████████████████| 16.3k/16.3k [00:00<00:00, 70.4MB/s]
modeling_internlm2.py: 100%|███████████████████████████████████████████████| 61.2k/61.2k [00:00<00:00, 77.4MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- modeling_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
conversation.py: 100%|█████████████████████████████████████████████████████| 15.0k/15.0k [00:00<00:00, 82.2MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_intern_vit.py: 100%|██████████████████████████████████████████████| 18.1k/18.1k [00:00<00:00, 75.5MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- modeling_internvl_chat.py
- modeling_internlm2.py
- conversation.py
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
FlashAttention2 is not installed.
model.safetensors.index.json: 100%|█████████████████████████████████████████| 51.2k/51.2k [00:00<00:00, 637kB/s]
model-00001-of-00004.safetensors: 100%|████████████████████████████████████| 4.94G/4.94G [08:10<00:00, 10.1MB/s]
model-00002-of-00004.safetensors: 100%|████████████████████████████████████| 4.92G/4.92G [01:28<00:00, 55.8MB/s]
model-00003-of-00004.safetensors: 100%|████████████████████████████████████| 4.92G/4.92G [01:25<00:00, 57.2MB/s]
model-00004-of-00004.safetensors: 100%|████████████████████████████████████| 1.38G/1.38G [00:34<00:00, 39.8MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████| 4/4 [11:40<00:00, 175.20s/it]
Warning: Flash attention is not available, using eager attention instead.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.86it/s]
generation_config.json: 100%|███████████████████████████████████████████████████| 115/115 [00:00<00:00, 679kB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████| 4.00k/4.00k [00:00<00:00, 26.3MB/s]
tokenization_internlm2.py: 100%|███████████████████████████████████████████| 8.79k/8.79k [00:00<00:00, 54.9MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- tokenization_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.model: 100%|█████████████████████████████████████████████████████| 1.48M/1.48M [00:00<00:00, 13.1MB/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████| 179/179 [00:00<00:00, 2.05MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████| 844/844 [00:00<00:00, 8.83MB/s]
Traceback (most recent call last):
  File "/home/kojoe/EasyAnimate/easyanimate/image_caption/template.py", line 132, in <module>
    response = model.chat(tokenizer, pixel_values, question, generation_config)
  File "/dev/shm/modules/transformers_modules/radna/XLA-InternVL2-8B/746cd35e611234c48f8dc5c61dbe30b5a782a208/modeling_internvl_chat.py", line 356, in chat
    generation_output = self.generate(
  File "/home/kojoe/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/dev/shm/modules/transformers_modules/radna/XLA-InternVL2-8B/746cd35e611234c48f8dc5c61dbe30b5a782a208/modeling_internvl_chat.py", line 410, in generate
    outputs = self.language_model.generate(
  File "/home/kojoe/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3038, in _sample
    unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/stopping_criteria.py", line 511, in __call__
    is_done = is_done | criteria(input_ids, scores, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/stopping_criteria.py", line 502, in __call__
    is_done = torch.isin(input_ids[:, -1], self.eos_token_id)
RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 29.95M of 16.00M vmem. Exceeded vmem capacity by 13.95M.

Program vmem requirement 29.95M:
    scoped           29.95M

  Largest program allocations in vmem:

  1. Size: 29.66M
     XLA label: register allocator spill slots call depth 2
     Allocation type: scoped
     ==========================

  2. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  3. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  4. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  5. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  6. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  7. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  8. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  9. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  10. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  11. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  12. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  13. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

To Reproduce

Steps to reproduce the behavior:

template.py


import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
import os
import torch_xla
import torch_xla.distributed.spmd as xs
import torch_xla.core.xla_model as xm
from torch_xla import runtime as xr

xr.use_spmd(auto=False)

from torch_xla.experimental.spmd_fully_sharded_data_parallel import ( _prepare_spmd_partition_spec, SpmdFullyShardedDataParallel as FSDPv2, )

IMAGENET_MEAN = (0.485, 0.456, 0.406) IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size): MEAN, STD = IMAGENET_MEAN, IMAGENET_STD transform = T.Compose([ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD) ]) return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): best_ratio_diff = float('inf') best_ratio = (1, 1) area = width height for ratio in target_ratios: target_aspect_ratio = ratio[0] / ratio[1] ratio_diff = abs(aspect_ratio - target_aspect_ratio) if ratio_diff < best_ratio_diff: best_ratio_diff = ratio_diff best_ratio = ratio elif ratio_diff == best_ratio_diff: if area > 0.5 image_size image_size ratio[0] * ratio[1]: best_ratio = ratio return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): orig_width, orig_height = image.size aspect_ratio = orig_width / orig_height

# calculate the existing image aspect ratio
target_ratios = set(
    (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
    i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
    aspect_ratio, target_ratios, orig_width, orig_height, image_size)

# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
    box = (
        (i % (target_width // image_size)) * image_size,
        (i // (target_width // image_size)) * image_size,
        ((i % (target_width // image_size)) + 1) * image_size,
        ((i // (target_width // image_size)) + 1) * image_size
    )
    # split the image
    split_img = resized_img.crop(box)
    processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
    thumbnail_img = image.resize((image_size, image_size))
    processed_images.append(thumbnail_img)
return processed_images

def load_image(image_file, input_size=448, max_num=12): image = Image.open(image_file).convert('RGB') transform = build_transform(input_size=input_size) images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) pixel_values = [transform(image) for image in images] pixel_values = torch.stack(pixel_values) return pixel_values

path = 'radna/XLA-InternVL2-8B' model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True, ).eval()

Define the mesh and partition_spec

num_devices = xr.global_runtime_device_count() mesh_shape = (num_devices, 1) device_ids = np.array(range(num_devices))

To be noted, the mesh must have an axis named 'fsdp', which the weights and activations will be sharded on.

mesh = xs.Mesh(device_ids, mesh_shape, ("fsdp", "model")) xs.set_global_mesh(mesh)

model = FSDPv2(model) tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

set the max number of tiles in `max_num`

pixel_values = load_image('./image1.jpg', max_num=1).to(torch.bfloat16).to(xm.xla_device()) generation_config = dict(max_new_tokens=1024, do_sample=True)

xs.mark_sharding(pixel_values, xs.get_global_mesh(), _prepare_spmd_partition_spec(pixel_values, shard_maximal=True))

single-image single-round conversation (单图单轮对话)

question = '\nPlease describe the image shortly.' response = model.chat(tokenizer, pixel_values, question, generation_config) print(f'User: {question}\nAssistant: {response}')


2. python template.py

<!-- If you have a code sample, error messages, stack traces, please provide it here as well. Or better use the Colab template: https://github.com/pytorch/xla/blob/master/contrib/colab/issue-report.ipynb -->

## Expected behavior

<!-- A clear and concise description of what you expected to happen. -->

Should run the Modified XLA version of InternVL2-8B model at https://huggingface.co/radna/XLA-InternVL2-8B

## Environment

 - Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
 - torch_xla version: NIghtly 2.5

## Additional context

<!-- Add any other context about the problem here. -->

Reproducable on TPU V2 && V3

pytorch / xla

Ran out of memory in memory space vmem / Extra memory due to padding #7942

🐛 Bug

To Reproduce

Define the mesh and partition_spec

To be noted, the mesh must have an axis named 'fsdp', which the weights and activations will be sharded on.

set the max number of tiles in `max_num`

single-image single-round conversation (单图单轮对话)

pytorch / xla

Ran out of memory in memory space vmem / Extra memory due to padding #7942

🐛 Bug

To Reproduce

Define the mesh and partition_spec

To be noted, the mesh must have an axis named 'fsdp', which the weights and activations will be sharded on.

set the max number of tiles in max_num

single-image single-round conversation (单图单轮对话)

set the max number of tiles in `max_num`