Open agadetsky opened 1 week ago
@agadetsky , it seems like there are differences in how we compute number of image tokens in the processing code and in modeling. Might be related to prev bugs with numerical issues when the image resolution is on the edge case of all possible grid resolutiions (like 337 here). I'll take a look and see where is the precision error coming
Hi @zucchini-nlp , have you managed to identify the issue? I'm encountering the same error while using llava-hf/llava-v1.6-mistral-7b-hf
. I haven't pinpointed the specific data causing the error, as it occurs midway through training. Could you also take a look at the modeling file of llava next? Maybe some calculation on the anyres is mismatched?
@chenweize1998 yes, that is most probably the anyres calculations. Unfortunately I didn't have time to look in more detail, will try to have a look today
EDIT: found the place where there was precision error and opened a PR to fix
@zucchini-nlp Thanks for looking into this! I've pinpointed the batch of data causing the issue and uploaded it here. The problem specifically originates from the first data point in the batch. Hope it helps with debugging.
Additionally, here’s a minimal script to reproduce the error (assuming the data point is downloaded as ./tmp.bin
):
from transformers import AutoModelForVision2Seq
import torch
# Load the model
model = AutoModelForVision2Seq.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
torch_dtype=torch.bfloat16
).to("cuda:0")
# Load the problematic input
inputs = torch.load("tmp.bin")
# Note: inputs['input_ids'][0] triggers the error
for k, v in inputs.items():
inputs[k] = v.to("cuda:0")
# Generate outputs
outputs = model(**inputs)
I'm using torch==2.4.0
and transformers==4.46.2
. Let me know if you need more details.
System Info
transformers
version: 4.46.2Who can help?
@amyeroberts @qubvel @ArthurZucker @itaz
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Error is the following
Expected behavior
Given that LLaVA-OneVision can work with any resolutions, the model is expected to successfully generate the output.