matatonic / openedai-vision

An OpenAI API compatible API for chat with image input and questions about the images. aka Multimodal.
GNU Affero General Public License v3.0
157 stars 12 forks source link

InternVL2 requires FlashAttention, only supports Ampere GPUs or newer #10

Closed dimitribellini closed 5 days ago

dimitribellini commented 1 month ago

Dear DevTeam, thanks so much for this great tool! During my test I found a big show stopper the "FlashAttention" option... In my setup I have two Nvidia RTX 8000 board and this board are from Turing family (TU102GL) and they not support "FlashAttention". Could be possible run the Vision models with this library?

I will add some more details: Used Model: "python vision.py -m OpenGVLab/InternVL2-1B --device-map cuda:0"

Logs:

  File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 290, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 86, in _flash_attn_varlen_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

I did not use the Flag "FlashAttention" but I still receive the error.

Thanks so much

matatonic commented 1 month ago

Oh strange, they must enable it by default... I'll take a look soon and see if it can be disabled.

Thanks for the report!

dimitribellini commented 1 month ago

Oh strange, they must enable it by default... I'll take a look soon and see if it can be disabled.

Thanks for the report!

Thanks so much!!! I'm very happy to be useful :-)

dimitribellini commented 1 month ago

@matatonic Hi see a new release, have you find a soluion for FlashAttention? Thanks so much

matatonic commented 1 month ago

Not yet sorry, been busy.

dimitribellini commented 1 month ago

yeah I can understand! Don't worry :-) Keep in mind, I would like to make a video on my YT channel to present your solution, because I think it's great :-)

Thanks so much

matatonic commented 2 weeks ago

I don't have a good way to test this, but based on the config.json file for OpenGVLab/InternVL2-1B you may be able to disable flash_attn there.

{
...
  "vision_config": {
    ...
    "use_flash_attn": false
  }
}

Now, this is not really advisable, and essentially corrupts the huggingface data, but this may work:

edit hf_home/hub/models--OpenGVLab--InternVL2-1B/snapshots/b631bf72a9a7aaf1329d3c523ea00df2854e2163/config.json

(or the latest snapshot folder)

matatonic commented 5 days ago

Just to add an update, I've tried changing the config.json to disable flash_attn and also disable bfloat16 but it didn't work. Without changing their code, it looks like InternVL2 requires Ampere (CUDA 8.0) or greater.