InternVL2 requires FlashAttention, only supports Ampere GPUs or newer

dimitribellini commented 1 month ago

Dear DevTeam, thanks so much for this great tool! During my test I found a big show stopper the "FlashAttention" option... In my setup I have two Nvidia RTX 8000 board and this board are from Turing family (TU102GL) and they not support "FlashAttention". Could be possible run the Vision models with this library?

I will add some more details: Used Model: "python vision.py -m OpenGVLab/InternVL2-1B --device-map cuda:0"

Logs:

  File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 290, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 86, in _flash_attn_varlen_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

I did not use the Flag "FlashAttention" but I still receive the error.

Thanks so much

matatonic commented 1 month ago

Oh strange, they must enable it by default... I'll take a look soon and see if it can be disabled.

Thanks for the report!

dimitribellini commented 1 month ago

Oh strange, they must enable it by default... I'll take a look soon and see if it can be disabled.

Thanks for the report!

Thanks so much!!! I'm very happy to be useful :-)

dimitribellini commented 1 month ago

@matatonic Hi see a new release, have you find a soluion for FlashAttention? Thanks so much

matatonic commented 1 month ago

Not yet sorry, been busy.

dimitribellini commented 1 month ago

yeah I can understand! Don't worry :-) Keep in mind, I would like to make a video on my YT channel to present your solution, because I think it's great :-)

Thanks so much

matatonic commented 2 weeks ago

I don't have a good way to test this, but based on the config.json file for OpenGVLab/InternVL2-1B you may be able to disable flash_attn there.

{
...
  "vision_config": {
    ...
    "use_flash_attn": false
  }
}

Now, this is not really advisable, and essentially corrupts the huggingface data, but this may work:

edit hf_home/hub/models--OpenGVLab--InternVL2-1B/snapshots/b631bf72a9a7aaf1329d3c523ea00df2854e2163/config.json

(or the latest snapshot folder)

matatonic commented 5 days ago

Just to add an update, I've tried changing the config.json to disable flash_attn and also disable bfloat16 but it didn't work. Without changing their code, it looks like InternVL2 requires Ampere (CUDA 8.0) or greater.

matatonic / openedai-vision

InternVL2 requires FlashAttention, only supports Ampere GPUs or newer #10