Open sheepymeh opened 3 months ago
Thanks! Mistral should be fairly easy to implement, just follow the updates done to Llama! 🤗 Generate will be refactored to support compile soon!
FYI @gante and @zucchini-nlp
I've checked that the following code runs:
import os
from functools import partial
import requests
import torch
from PIL import Image
from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor, StaticCache
os.environ["TOKENIZERS_PARALLELISM"] = "true"
with torch.inference_mode():
processor = LlavaNextProcessor.from_pretrained("llava-mistral")
model = LlavaNextForConditionalGeneration.from_pretrained("llava-mistral", low_cpu_mem_usage=True, torch_dtype=torch.float16).cuda()
static_cache = partial(StaticCache, dtype=torch.float16)
model.language_model._setup_cache(static_cache, max_batch_size=1, max_cache_len=4096)
model.language_model.compile(fullgraph=True)
model.vision_tower.compile(fullgraph=True)
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>"
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
output = model(**inputs,)
print(output)
Hey @sheepymeh 👋
Given that the individual models are compileable (according to your script above), the next step would be to rewrite the logic in between the language model and the vision tower to be compilable as well such that model.forward
can be compiled. It shouldn't require API changes, so we're welcoming contributions 💪
The process is very model-dependant, often requiring padding or clever manipulations to enable it. My suggestion would be to iteratively call the compiled forward pass, see where it crashes, rewrite, and repeat until it runs. Then, after the whole thing compiles, do a second pass to optimize performance (with the aid of a profiler) -- the compiled forward should be significantly faster than the uncompiled one, after 2-3 warmup forward passes.
Looking forward to the PR! If you get stuck, let us know :)
Thank you for your suggestions. I'm currently working on a PR and my current progress is on creating fixed-length tensors for everything. For example, final_embedding
would be initialized to a shape of (batch, max_position_embeddings, embed_dim). However, this would create a lot of unnecessary and wasteful padding, which would be passed into the LLM. How could I mitigate this or is this an acceptable compromise? For reference, the max_position_embeddings for llava-1.6-mistral is 32768, which would be padded a lot of the time.
For reference, here is the (very unoptimized) version I'm working on:
Thank you for your suggestions. I'm currently working on a PR and my current progress is on creating fixed-length tensors for everything. For example, final_embedding would be initialized to a shape of (batch, max_position_embeddings, embed_dim). However, this would create a lot of unnecessary and wasteful padding, which would be passed into the LLM. How could I mitigate this or is this an acceptable compromise? For reference, the max_position_embeddings for llava-1.6-mistral is 32768, which would be padded a lot of the time.
When you generate, you mostly want the decoding
part, when the input ids is only 1 token, to be fast. This should not require much changes as the final embeddings would be of the same size as the input itds.
One thing that could help with this is actually to pre-process the strings, in order to make sure that the input of the model is the only thing that changes in terms of shapes.
So instead of creating the final embedding, you replace "Hey <image> is this nice?"
with Hey <image><image>.......<image> is this nice?"
in the processor.
That way the embedding created is of the correct shape.
This would break compatability with previous code and the original LLaVA codebase completely. Is it advisable to maintain two code paths in the processor (one with a single
Thank you for the help so far. I've managed to implement @ArthurZucker's suggestion successfully in the preprocessor. However, the unpad_image
function:
def unpad_image(tensor, original_size):
"""
Unpads a PyTorch tensor of a padded and resized image.
Args:
tensor (`torch.Tensor`):
The image tensor, assumed to be of shape (num_channels, height, width).
original_size (`tuple`):
The original size of the image (height, width).
Returns:
`torch.Tensor`: The unpadded image tensor.
"""
original_height, original_width = original_size
current_height, current_width = tensor.shape[1:]
original_aspect_ratio = original_width / original_height
current_aspect_ratio = current_width / current_height
if original_aspect_ratio > current_aspect_ratio:
scale_factor = current_width / original_width
new_height = int(original_height * scale_factor)
padding = (current_height - new_height) // 2
unpadded_tensor = tensor[:, padding : current_height - padding, :]
else:
scale_factor = current_height / original_height
new_width = int(original_width * scale_factor)
padding = (current_width - new_width) // 2
unpadded_tensor = tensor[:, :, padding : current_width - padding]
return unpadded_tensor
requires original_size
to be passed into the model through the preprocessor, and it cannot be in the form of a PyTorch tensor as slicing by tensor values is unsupported:
torch._dynamo.exc.Unsupported: Dynamic slicing on data-dependent value is not supported
from user code:
File "/home/sheepymeh/transformers/src/transformers/models/llava_next/modeling_llava_next.py", line 427, in unpad_image
unpadded_tensor = tensor[:, padding : current_height - padding, :]
Is it possible to somehow pass these values in as a list of ints? Trying that causes this error when using the .to()
function:
Traceback (most recent call last):
File "/home/sheepymeh/transformers/test.py", line 14, in <module>
inputs = inputs.to("cuda:0")
File "/home/sheepymeh/transformers/src/transformers/feature_extraction_utils.py", line 229, in to
if torch.is_floating_point(v):
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
Not super-sure, but Llava does not require this, only Llava next.
Yes, it's only Llava-next. Btw, not sure how helpul is this info but in the linked PR we had to convert "image sizes" to a list inside "modeling.py", because there were mismatches in the way resolutions are calculated in the image processor vs in the modeling side.
Thanks for the inputs. Unfortunately it's not possible to convert tensors to lists in a compiled PyTorch model, thus we would need to implement the vectorized image_size
first.
Feature request
As per #28981, LLaVA is planned to receive
torch.compile
support. Seeing to the fact that LLaVA is composed of a vision tower and a LLM, both of which can be separately compiled withfullgraph=True
(after support has been added, which is not the case for Mistral), it seems much easier to compile both parts separately as well.Motivation
The
_merge_input_ids_with_image_features
function that connects the two parts is difficult to compile as PyTorch has yet to add support for many of the functions used that require dynamic input sizes, which are necessary here as the number of input image tokens is subject to change.Your contribution
I'd love to try submitting a PR if possible but I'm not sure what the best way to do so is given the current circumstances.