Open gehong-coder opened 1 month ago
We've implemented support for eager attention. Could you please test the following code and let me know if you encounter any issues? @gehong-coder
model = AutoModelForCausalLM.from_pretrained(
"rhymes-ai/Aria",
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True, # Corrected 'true' to 'True'
attn_implementation="eager",
)
We've implemented support for eager attention. Could you please test the following code and let me know if you encounter any issues? @gehong-coder
model = AutoModelForCausalLM.from_pretrained( "rhymes-ai/Aria", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, # Corrected 'true' to 'True' attn_implementation="eager", )
Hello, this problem occurs after I use the above settings. It seems that setting attn_implementation = eager here cannot use eager internally.
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward hidden_states, attn_weights = self.self_attn( return super().apply(*args, **kwargs) # type: ignore[misc] File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 619, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd( RuntimeError: FlashAttention only supports Ampere GPUs or newer.
So I went into modeling_idefics2 and changed line 442 of self.self_attn = IDEFICS_VISION_ATTENTION_CLASSESconfig._attn_implementation and config._attn_implementation to eager. Then it will appear "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/modeling_aria.py", line 376, in forward image_outputs, image_attn_mask = self.vision_tower( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/vision_encoder.py", line 120, in forward vit_oup = self.vision_model( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward encoder_outputs = self.encoder( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward layer_outputs = encoder_layer( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward hidden_states, attn_weights = self.self_attn( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward raise ValueError( ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])
@gehong-coder Is your local model updated to the latest rhymes-ai/Aria repo? We updated it yesterday
I have updated the model, but it still appears. Is it because grouped_gemm is not installed?
grouped_gemmis not installed, using sequential GEMM, which is slower. AriaMoELMForCausalLM has generative capabilities, as
prepare_inputs_for_generationis explicitly overwritten. However, it doesn't directly inherit from
GenerationMixin. From πv4.50π onwards,
PreTrainedModelwill NOT inherit from
GenerationMixin, and this model will lose the ability to call
generate` and other related functions.
trust_remote_code=True
, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classesGenerationMixin
(after PreTrainedModel
, otherwise you'll get an exception).torch.cuda.amp.autocast(args...)
is deprecated. Please use torch.amp.autocast('cuda', args...)
instead.
with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:601: UserWarning: do_sample
is set to False
. However, temperature
is set to 0.0
-- this flag is only used in sample-based generation modes. You should set do_sample=True
or unset temperature
.
warnings.warn(
The seen_tokens
attribute is deprecated and will be removed in v4.41. Use the cache_position
model input instead.
Traceback (most recent call last):
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/github/Aria/inference/notebooks/video_in.py", line 166, in Eager attention is not working, and not be able to run the model on V100s. Could you please help with this feature?
@gehong-coder I can't reproduce this error on my local machine. Could you provide some minimal code to reproduce this bug? And what is the version of your transformers
?
@gehong-coder I can't reproduce this error on my local machine. Could you provide some minimal code to reproduce this bug? And what is the version of your
transformers
?
python 3.10 tokenizers 0.20.1 torch 2.4.0 torchvision 0.19.0 tqdm 4.66.5 transformers 4.45.0 triton 3.0.0
this is my code `import requests import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor from decord import VideoReader from PIL import Image from tqdm import tqdm from typing import List import os
def load_model(): model_id_or_path = "/home/hong.ge/.cache/torch/hub/models--rhymes-ai--Aria/snapshots/5cc2703b3afd585f232ec5027e9c039a2001bcec" model = AutoModelForCausalLM.from_pretrained( model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, # Corrected 'true' to 'True' attn_implementation="eager", )
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
return model, processor
model, processor = load_model() def load_video(video_file, num_frames=128, cache_dir="cached_video_frames", verbosity="DEBUG"):
os.makedirs(cache_dir, exist_ok=True)
video_basename = os.path.basename(video_file)
cache_subdir = os.path.join(cache_dir, f"{video_basename}_{num_frames}")
os.makedirs(cache_subdir, exist_ok=True)
cached_frames = []
missing_frames = []
frame_indices = []
for i in range(num_frames):
frame_path = os.path.join(cache_subdir, f"frame_{i}.jpg")
if os.path.exists(frame_path):
cached_frames.append(frame_path)
else:
missing_frames.append(i)
frame_indices.append(i)
vr = VideoReader(video_file)
duration = len(vr)
fps = vr.get_avg_fps()
frame_timestamps = [int(duration / num_frames * (i+0.5)) / fps for i in range(num_frames)]
if verbosity == "DEBUG":
print("Already cached {}/{} frames for video {}, enjoy speed!".format(len(cached_frames), num_frames, video_file))
# If all frames are cached, load them directly
if not missing_frames:
return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps
actual_frame_indices = [int(duration / num_frames * (i+0.5)) for i in missing_frames]
missing_frames_data = vr.get_batch(actual_frame_indices).asnumpy()
for idx, frame_index in enumerate(tqdm(missing_frames, desc="Caching rest frames")):
img = Image.fromarray(missing_frames_data[idx]).convert("RGB")
frame_path = os.path.join(cache_subdir, f"frame_{frame_index}.jpg")
img.save(frame_path)
cached_frames.append(frame_path)
cached_frames.sort(key=lambda x: int(os.path.basename(x).split('_')[1].split('.')[0]))
return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps
def create_image_gallery(images, columns=3, spacing=20, bg_color=(200, 200, 200)): """ Combine multiple images into a single larger image in a grid format.
Parameters:
image_paths (list of str): List of file paths to the images to display.
columns (int): Number of columns in the gallery.
spacing (int): Space (in pixels) between the images in the gallery.
bg_color (tuple): Background color of the gallery (R, G, B).
Returns:
PIL.Image: A single combined image.
"""
# Open all images and get their sizes
img_width, img_height = images[0].size # Assuming all images are of the same size
# Calculate rows needed for the gallery
rows = (len(images) + columns - 1) // columns
# Calculate the size of the final gallery image
gallery_width = columns * img_width + (columns - 1) * spacing
gallery_height = rows * img_height + (rows - 1) * spacing
# Create a new image with the calculated size and background color
gallery_image = Image.new('RGB', (gallery_width, gallery_height), bg_color)
# Paste each image into the gallery
for index, img in enumerate(images):
row = index // columns
col = index % columns
x = col * (img_width + spacing)
y = row * (img_height + spacing)
gallery_image.paste(img, (x, y))
return gallery_image
def get_placeholders_forvideos(frames: List, timestamps=[]): contents = [] if not timestamps: for i, in enumerate(frames): contents.append({"text": None, "type": "image"}) contents.append({"text": "\n", "type": "text"}) else: for i, (_, ts) in enumerate(zip(frames, timestamps)): contents.extend( [ {"text": f"[{int(ts)//60:02d}:{int(ts)%60:02d}]", "type": "text"}, {"text": None, "type": "image"}, {"text": "\n", "type": "text"} ] ) return contents
def infer(contents): torch.cuda.empty_cache()
messages = [
{
"role": "user",
"content": [
*contents,
{"text": "Please list the burgers that appear in this video, and how they are made.", "type": "text"},
],
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=frames, return_tensors="pt", max_image_size=490)
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
output = model.generate(
**inputs,
max_new_tokens=2048,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=False,
temperature=0.,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
result = processor.decode(output_ids, skip_special_tokens=True)
print(result)
frames, frame_timestamps = load_video("/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/data/test_data/test_caption/video/ι£ζΊ.mp4", num_frames=128) contents = get_placeholders_for_videos(frames, frame_timestamps) infer(contents)`
@gehong-coder
It seems that you are using the code and model weights from huggingface cache dir /home/hong.ge/.cache/torch/hub/models--rhymes-ai--Aria/snapshots/5cc2703b3afd585f232ec5027e9c039a2001bcec
please make sure all py files and json are aligned with the latest configuration.
The recommended way to load the latest Aria is to load it from the online official site model = AutoModelForCausalLM.from_pretrained("rhymes-ai/Aria", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
. It will automatically check if those files are new.
@aria-hacker I have downloaded the latest version of the model, using the script Aria/inference/notebooks/04_video_understanding.ipynb Using a v100 machine, then the following comes up So I changed this again But, it's still giving me this problem... are you guys sure it will work in v100?
@gehong-coder In most cases, you should not edit the code inside the transformers if you don't understand its whole context. I looked into it, and the modification you made in the wrong way which caused the error. The attention mask is built based on the type of attention name. However, you just directly modified the attention implementation, and the configuration stays in the flash_attention_2
way (4d mask) causing the error using FA2's mask with eager attention.
You should only modify the config for the vision encoder and model, that's how we passed attn_implementation in the latest code.
V100 cannot use flash attention, so I changed to using eager to calculate attention, self.self_attn = IDEFICS_VISION_ATTENTION_CLASSES"eager"
but the following error occurredοΌ
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward encoder_outputs = self.encoder( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward layer_outputs = encoder_layer( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward hidden_states, attn_weights = self.self_attn( File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward raise ValueError( ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])