rhymes-ai / Aria

Codebase for Aria - an Open Multimodal Native MoE
Apache License 2.0
850 stars 71 forks source link

V100 run video understanding #29

Open gehong-coder opened 1 month ago

gehong-coder commented 1 month ago

V100 cannot use flash attention, so I changed to using eager to calculate attention, self.self_attn = IDEFICS_VISION_ATTENTION_CLASSES"eager"

but the following error occurred:

File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward encoder_outputs = self.encoder( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward layer_outputs = encoder_layer( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward hidden_states, attn_weights = self.self_attn( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward raise ValueError( ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])

aria-hacker commented 1 month ago

We've implemented support for eager attention. Could you please test the following code and let me know if you encounter any issues? @gehong-coder

model = AutoModelForCausalLM.from_pretrained(
    "rhymes-ai/Aria",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Corrected 'true' to 'True'
    attn_implementation="eager",
)
gehong-coder commented 1 month ago

We've implemented support for eager attention. Could you please test the following code and let me know if you encounter any issues? @gehong-coder

model = AutoModelForCausalLM.from_pretrained(
    "rhymes-ai/Aria",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Corrected 'true' to 'True'
    attn_implementation="eager",
)

Hello, this problem occurs after I use the above settings. It seems that setting attn_implementation = eager here cannot use eager internally.

File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward hidden_states, attn_weights = self.self_attn( return super().apply(*args, **kwargs) # type: ignore[misc] File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 619, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd( RuntimeError: FlashAttention only supports Ampere GPUs or newer.

So I went into modeling_idefics2 and changed line 442 of self.self_attn = IDEFICS_VISION_ATTENTION_CLASSESconfig._attn_implementation and config._attn_implementation to eager. Then it will appear "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/modeling_aria.py", line 376, in forward image_outputs, image_attn_mask = self.vision_tower( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/vision_encoder.py", line 120, in forward vit_oup = self.vision_model( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward encoder_outputs = self.encoder( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward layer_outputs = encoder_layer( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward hidden_states, attn_weights = self.self_attn( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward raise ValueError( ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])

aria-hacker commented 1 month ago

@gehong-coder Is your local model updated to the latest rhymes-ai/Aria repo? We updated it yesterday

gehong-coder commented 1 month ago

I have updated the model, but it still appears. Is it because grouped_gemm is not installed?

grouped_gemmis not installed, using sequential GEMM, which is slower. AriaMoELMForCausalLM has generative capabilities, asprepare_inputs_for_generationis explicitly overwritten. However, it doesn't directly inherit fromGenerationMixin. From πŸ‘‰v4.50πŸ‘ˆ onwards,PreTrainedModelwill NOT inherit fromGenerationMixin, and this model will lose the ability to callgenerate` and other related functions.

saeedkhaki92 commented 1 month ago

Eager attention is not working, and not be able to run the model on V100s. Could you please help with this feature?

aria-hacker commented 1 month ago

@gehong-coder I can't reproduce this error on my local machine. Could you provide some minimal code to reproduce this bug? And what is the version of your transformers?

gehong-coder commented 1 month ago

@gehong-coder I can't reproduce this error on my local machine. Could you provide some minimal code to reproduce this bug? And what is the version of your transformers?

python 3.10 tokenizers 0.20.1 torch 2.4.0 torchvision 0.19.0 tqdm 4.66.5 transformers 4.45.0 triton 3.0.0

this is my code `import requests import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor from decord import VideoReader from PIL import Image from tqdm import tqdm from typing import List import os

def load_model(): model_id_or_path = "/home/hong.ge/.cache/torch/hub/models--rhymes-ai--Aria/snapshots/5cc2703b3afd585f232ec5027e9c039a2001bcec" model = AutoModelForCausalLM.from_pretrained( model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, # Corrected 'true' to 'True' attn_implementation="eager", )

model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)

processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
return model, processor

model, processor = load_model() def load_video(video_file, num_frames=128, cache_dir="cached_video_frames", verbosity="DEBUG"):

Create cache directory if it doesn't exist

os.makedirs(cache_dir, exist_ok=True)

video_basename = os.path.basename(video_file)
cache_subdir = os.path.join(cache_dir, f"{video_basename}_{num_frames}")
os.makedirs(cache_subdir, exist_ok=True)

cached_frames = []
missing_frames = []
frame_indices = []

for i in range(num_frames):
    frame_path = os.path.join(cache_subdir, f"frame_{i}.jpg")
    if os.path.exists(frame_path):
        cached_frames.append(frame_path)
    else:
        missing_frames.append(i)
        frame_indices.append(i) 

vr = VideoReader(video_file)
duration = len(vr)
fps = vr.get_avg_fps()

frame_timestamps = [int(duration / num_frames * (i+0.5)) / fps for i in range(num_frames)]

if verbosity == "DEBUG":
    print("Already cached {}/{} frames for video {}, enjoy speed!".format(len(cached_frames), num_frames, video_file))
# If all frames are cached, load them directly
if not missing_frames:
    return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps

actual_frame_indices = [int(duration / num_frames * (i+0.5)) for i in missing_frames]

missing_frames_data = vr.get_batch(actual_frame_indices).asnumpy()

for idx, frame_index in enumerate(tqdm(missing_frames, desc="Caching rest frames")):
    img = Image.fromarray(missing_frames_data[idx]).convert("RGB")
    frame_path = os.path.join(cache_subdir, f"frame_{frame_index}.jpg")
    img.save(frame_path)
    cached_frames.append(frame_path)

cached_frames.sort(key=lambda x: int(os.path.basename(x).split('_')[1].split('.')[0]))
return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps

def create_image_gallery(images, columns=3, spacing=20, bg_color=(200, 200, 200)): """ Combine multiple images into a single larger image in a grid format.

Parameters:
    image_paths (list of str): List of file paths to the images to display.
    columns (int): Number of columns in the gallery.
    spacing (int): Space (in pixels) between the images in the gallery.
    bg_color (tuple): Background color of the gallery (R, G, B).

Returns:
    PIL.Image: A single combined image.
"""
# Open all images and get their sizes
img_width, img_height = images[0].size  # Assuming all images are of the same size

# Calculate rows needed for the gallery
rows = (len(images) + columns - 1) // columns

# Calculate the size of the final gallery image
gallery_width = columns * img_width + (columns - 1) * spacing
gallery_height = rows * img_height + (rows - 1) * spacing

# Create a new image with the calculated size and background color
gallery_image = Image.new('RGB', (gallery_width, gallery_height), bg_color)

# Paste each image into the gallery
for index, img in enumerate(images):
    row = index // columns
    col = index % columns

    x = col * (img_width + spacing)
    y = row * (img_height + spacing)

    gallery_image.paste(img, (x, y))

return gallery_image

def get_placeholders_forvideos(frames: List, timestamps=[]): contents = [] if not timestamps: for i, in enumerate(frames): contents.append({"text": None, "type": "image"}) contents.append({"text": "\n", "type": "text"}) else: for i, (_, ts) in enumerate(zip(frames, timestamps)): contents.extend( [ {"text": f"[{int(ts)//60:02d}:{int(ts)%60:02d}]", "type": "text"}, {"text": None, "type": "image"}, {"text": "\n", "type": "text"} ] ) return contents

def infer(contents): torch.cuda.empty_cache()

messages = [
    {
        "role": "user",
        "content": [
            *contents,
            {"text": "Please list the burgers that appear in this video, and how they are made.", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=frames, return_tensors="pt", max_image_size=490)
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=2048,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=False,
        temperature=0.,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)

print(result)

frames, frame_timestamps = load_video("/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/data/test_data/test_caption/video/飞机.mp4", num_frames=128) contents = get_placeholders_for_videos(frames, frame_timestamps) infer(contents)`

aria-hacker commented 1 month ago

@gehong-coder It seems that you are using the code and model weights from huggingface cache dir /home/hong.ge/.cache/torch/hub/models--rhymes-ai--Aria/snapshots/5cc2703b3afd585f232ec5027e9c039a2001bcec please make sure all py files and json are aligned with the latest configuration.

The recommended way to load the latest Aria is to load it from the online official site model = AutoModelForCausalLM.from_pretrained("rhymes-ai/Aria", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True). It will automatically check if those files are new.

gehong-coder commented 1 month ago

@aria-hacker I have downloaded the latest version of the model, using the script Aria/inference/notebooks/04_video_understanding.ipynb Using a v100 machine, then the following comes up image So I changed this again image But, it's still giving me this problem... are you guys sure it will work in v100? image

aria-hacker commented 1 month ago

@gehong-coder In most cases, you should not edit the code inside the transformers if you don't understand its whole context. I looked into it, and the modification you made in the wrong way which caused the error. The attention mask is built based on the type of attention name. However, you just directly modified the attention implementation, and the configuration stays in the flash_attention_2 way (4d mask) causing the error using FA2's mask with eager attention.

image

You should only modify the config for the vision encoder and model, that's how we passed attn_implementation in the latest code.