Overflow with CPU Option

hypernovas commented 4 months ago

Hi Vik,

Thanks for all the help! And it works perfectly with cuda option. Wondering if you have seen this before while using cpu

The model is loaded by:

DEVICE = "cpu"
DTYPE = torch.float32 if DEVICE == "cpu" else torch.float16 # CPU doesn't support float16
MD_REVISION = "2024-07-23"
local_model_folder = './checkpoints/moondream-ft'
model_name = "vikhyatk/moondream2"

tokenizer = AutoTokenizer.from_pretrained("vikhyatk/moondream2", revision=MD_REVISION)
moondream = AutoModelForCausalLM.from_pretrained(
    model_name, revision=MD_REVISION, trust_remote_code=True,
    attn_implementation="flash_attention_2" if DEVICE == "cuda" else None,
    torch_dtype=DTYPE, device_map={"": DEVICE}
)

Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[28], line 25
     23 img_path = "./test.jpg"
     24 prompt = "Describe any defects"
---> 25 answer = run_model(img_path, prompt)
     26 display(answer)

Cell In[28], line 17, in run_model(img_path, prompt, scale)
     14     resized_img = resize_image(rgb_img)
     16 # Encode and get answer from the model
---> 17 answer = moondream.answer_question(
     18     moondream.encode_image(resized_img), prompt, tokenizer
     19 )
     20 return answer

File [~/.cache/huggingface/modules/transformers_modules/vikhyatk/moondream2/79671eae7b5340017e91065d09c1ce1a352c0e8d/moondream.py:99](http://10.191.13.92:8888/home/ubuntu/.cache/huggingface/modules/transformers_modules/vikhyatk/moondream2/79671eae7b5340017e91065d09c1ce1a352c0e8d/moondream.py#line=98), in Moondream.answer_question(self, image_embeds, question, tokenizer, chat_history, result_queue, **kwargs)
     89 def answer_question(
     90     self,
     91     image_embeds,
   (...)
     96     **kwargs,
     97 ):
     98     prompt = f"<image>\n\n{chat_history}Question: {question}\n\nAnswer:"
---> 99     answer = self.generate(
    100         image_embeds,
    101         prompt,
    102         tokenizer=tokenizer,
    103         max_new_tokens=512,
    104         **kwargs,
    105     )[0]
    106     cleaned_answer = answer.strip()
    108     # Use the result_queue to pass the result if it is provided

File [~/.cache/huggingface/modules/transformers_modules/vikhyatk/moondream2/79671eae7b5340017e91065d09c1ce1a352c0e8d/moondream.py:83](http://10.191.13.92:8888/home/ubuntu/.cache/huggingface/modules/transformers_modules/vikhyatk/moondream2/79671eae7b5340017e91065d09c1ce1a352c0e8d/moondream.py#line=82), in Moondream.generate(self, image_embeds, prompt, tokenizer, max_new_tokens, **kwargs)
     81 with torch.no_grad():
     82     inputs_embeds = self.input_embeds(prompt, image_embeds, tokenizer)
---> 83     output_ids = self.text_model.generate(
     84         inputs_embeds=inputs_embeds, **generate_config
     85     )
     87 return tokenizer.batch_decode(output_ids, skip_special_tokens=True)

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/utils/_contextlib.py:115](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/utils/_contextlib.py#line=114), in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/generation/utils.py:1914](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/generation/utils.py#line=1913), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1906     input_ids, model_kwargs = self._expand_inputs_for_generation(
   1907         input_ids=input_ids,
   1908         expand_size=generation_config.num_return_sequences,
   1909         is_encoder_decoder=self.config.is_encoder_decoder,
   1910         **model_kwargs,
   1911     )
   1913     # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 1914     result = self._sample(
   1915         input_ids,
   1916         logits_processor=prepared_logits_processor,
   1917         logits_warper=prepared_logits_warper,
   1918         stopping_criteria=prepared_stopping_criteria,
   1919         generation_config=generation_config,
   1920         synced_gpus=synced_gpus,
   1921         streamer=streamer,
   1922         **model_kwargs,
   1923     )
   1925 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   1926     # 11. prepare logits warper
   1927     prepared_logits_warper = (
   1928         self._get_logits_warper(generation_config, device=input_ids.device)
   1929         if generation_config.do_sample
   1930         else None
   1931     )

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/generation/utils.py:2651](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/generation/utils.py#line=2650), in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, logits_warper, **model_kwargs)
   2648 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2650 # forward pass to get next token
-> 2651 outputs = self(
   2652     **model_inputs,
   2653     return_dict=True,
   2654     output_attentions=output_attentions,
   2655     output_hidden_states=output_hidden_states,
   2656 )
   2658 if synced_gpus and this_peer_finished:
   2659     continue  # don't waste resources running the code we don't need

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File [~/.cache/huggingface/modules/transformers_modules/vikhyatk/moondream2/79671eae7b5340017e91065d09c1ce1a352c0e8d/modeling_phi.py:1051](http://10.191.13.92:8888/home/ubuntu/.cache/huggingface/modules/transformers_modules/vikhyatk/moondream2/79671eae7b5340017e91065d09c1ce1a352c0e8d/modeling_phi.py#line=1050), in PhiForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1046 return_dict = (
   1047     return_dict if return_dict is not None else self.config.use_return_dict
   1048 )
   1050 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1051 outputs = self.transformer(
   1052     input_ids=input_ids,
   1053     attention_mask=attention_mask,
   1054     position_ids=position_ids,
   1055     past_key_values=past_key_values,
   1056     inputs_embeds=inputs_embeds,
   1057     use_cache=use_cache,
   1058     output_attentions=output_attentions,
   1059     output_hidden_states=output_hidden_states,
   1060     return_dict=return_dict,
   1061 )
   1063 hidden_states = outputs[0]
   1064 logits = self.lm_head(hidden_states)

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File [~/.cache/huggingface/modules/transformers_modules/vikhyatk/moondream2/79671eae7b5340017e91065d09c1ce1a352c0e8d/modeling_phi.py:878](http://10.191.13.92:8888/home/ubuntu/.cache/huggingface/modules/transformers_modules/vikhyatk/moondream2/79671eae7b5340017e91065d09c1ce1a352c0e8d/modeling_phi.py#line=877), in PhiModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    871     attention_mask = (
    872         attention_mask
    873         if (attention_mask is not None and 0 in attention_mask)
    874         else None
    875     )
    876 else:
    877     # 4d mask is passed through the layers
--> 878     attention_mask = _prepare_4d_causal_attention_mask(
    879         attention_mask,
    880         (batch_size, seq_length),
    881         inputs_embeds,
    882         past_key_values_length,
    883     )
    885 hidden_states = inputs_embeds
    887 # decoder layers

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:321](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py#line=320), in _prepare_4d_causal_attention_mask(attention_mask, input_shape, inputs_embeds, past_key_values_length, sliding_window)
    319 # 4d mask is passed through the layers
    320 if attention_mask is not None and len(attention_mask.shape) == 2:
--> 321     attention_mask = attn_mask_converter.to_4d(
    322         attention_mask, input_shape[-1], key_value_length=key_value_length, dtype=inputs_embeds.dtype
    323     )
    324 elif attention_mask is not None and len(attention_mask.shape) == 4:
    325     expected_shape = (input_shape[0], 1, input_shape[1], key_value_length)

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:121](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py#line=120), in AttentionMaskConverter.to_4d(self, attention_mask_2d, query_length, dtype, key_value_length)
    116         raise ValueError(
    117             "This attention mask converter is causal. Make sure to pass `key_value_length` to correctly create a causal mask."
    118         )
    120     past_key_values_length = key_value_length - query_length
--> 121     causal_4d_mask = self._make_causal_mask(
    122         input_shape,
    123         dtype,
    124         device=attention_mask_2d.device,
    125         past_key_values_length=past_key_values_length,
    126         sliding_window=self.sliding_window,
    127     )
    128 elif self.sliding_window is not None:
    129     raise NotImplementedError("Sliding window is currently only implemented for causal masking")

File [~/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:156](http://10.191.13.92:8888/home/ubuntu/anaconda3/envs/ray_py310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py#line=155), in AttentionMaskConverter._make_causal_mask(input_ids_shape, dtype, device, past_key_values_length, sliding_window)
    152 """
    153 Make causal mask used for bi-directional self-attention.
    154 """
    155 bsz, tgt_len = input_ids_shape
--> 156 mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
    157 mask_cond = torch.arange(mask.size(-1), device=device)
    158 mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)

RuntimeError: value cannot be converted to type at::Half without overflow

vikhyat commented 4 months ago

Haven't seen it before, looks like it's coming from the transformers library. Can you share the image/prompt so I can try to reproduce?

vikhyat commented 4 months ago

FYI we're also very close to shipping llama.cpp based inference code that will run a lot faster on CPU than the PyTorch implementation. Development on that is going on in the moondream-ggml branch here: https://github.com/vikhyat/moondream/tree/moondream-ggml

hypernovas commented 4 months ago

Sure, thanks!

1000012261


from PIL import Image

def resize_image(img, max_dimension=300):
    # Calculate the ratio to resize by
    ratio = max_dimension / max(img.size)
    new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
    # Resize the image using LANCZOS resampling, recommended for downsampling
    return img.resize(new_size, Image.Resampling.LANCZOS)

def run_model(img_path, prompt, scale=4.2):
    # Open, convert to RGB, and resize the image
    with Image.open(img_path) as img:
        rgb_img = img.convert('RGB')  # Convert to RGB
        resized_img = resize_image(rgb_img)

    # Encode and get answer from the model
    answer = moondream.answer_question(
        moondream.encode_image(resized_img), prompt, tokenizer
    )
    return answer

# Usage example
img_path = "./test.jpg"
prompt = "Describe any defects"
answer = run_model(img_path, prompt)
display(answer)

hypernovas commented 4 months ago

The lib versions:

!pip install accelerate==0.32.1 huggingface-hub==0.24.0 Pillow==10.4.0 torch==2.3.1 torchvision==0.18.1 transformers==4.42.2 einops==0.8.0 gradio==4.38.1
!pip install flash-attn==2.6.2 datasets==2.20.0

vikhyat / moondream

Overflow with CPU Option #120