mistralai / mistral-inference

Official inference library for Mistral models
https://mistral.ai/
Apache License 2.0
9.49k stars 835 forks source link

on Jetson ORIN, Xformer, Memory-efficient attention, SwiGLU, sparse and more won't be available. #80

Open cj401 opened 9 months ago

cj401 commented 9 months ago

Hi mistral ai team,

thanks for sharing the great work.

I was trying to run mistral-7b on Jetson ORIN with Jetpack (# R35 (release), REVISION: 4.1, GCID: 33958178, BOARD: t186ref, EABI: aarch64, DATE: Tue Aug 1 19:57:35 UTC 2023).

I built triton (openAI) and xformers from source without problems.

However, when I tried to run

python -m main demo /path/to/mistral-7B-v0.1/

I got following errors:

python -m main demo mistral-7B-v0.1/
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0a0+41361538.nv23.06 with CUDA 1104 (you have 2.1.0a0+41361538.nv23.06)
    Python  3.8.10 (you have 3.8.10)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details

raise NotImplementedError(msg)
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(1, 27, 32, 128) (torch.float16)
     key         : shape=(1, 27, 32, 128) (torch.float16)
     value       : shape=(1, 27, 32, 128) (torch.float16)
     attn_bias   : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalLocalAttentionMask'>
     p           : 0.0
`decoderF` is not supported because:
    xFormers wasn't build with CUDA support
    attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalLocalAttentionMask'>
    operator wasn't built - see `python -m xformers.info` for more info
`flshattF@0.0.0` is not supported because:
    xFormers wasn't build with CUDA support
`tritonflashattF` is not supported because:
    xFormers wasn't build with CUDA support
    attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalLocalAttentionMask'>
    operator wasn't built - see `python -m xformers.info` for more info
    triton is not available
    Only work on pre-MLIR triton for now
`cutlassF` is not supported because:
    xFormers wasn't build with CUDA support
    operator wasn't built - see `python -m xformers.info` for more info
`smallkF` is not supported because:
    max(query.shape[-1] != value.shape[-1]) > 32
    xFormers wasn't build with CUDA support
    dtype=torch.float16 (supported: {torch.float32})
    attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalLocalAttentionMask'>
    operator wasn't built - see `python -m xformers.info` for more info
    unsupported embed per head: 128

then tried

python3 -m xformers.info

I got

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0a0+41361538.nv23.06 with CUDA 1104 (you have 2.1.0a0+41361538.nv23.06)
    Python  3.8.10 (you have 3.8.10)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
xFormers 0.0.24+40d3967.d20231209
memory_efficient_attention.cutlassF:               unavailable
memory_efficient_attention.cutlassB:               unavailable
memory_efficient_attention.decoderF:               unavailable
memory_efficient_attention.flshattF@0.0.0:         available
memory_efficient_attention.flshattB@0.0.0:         available
memory_efficient_attention.smallkF:                unavailable
memory_efficient_attention.smallkB:                unavailable
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             unavailable
swiglu.gemm_fused_operand_sum:                     unavailable
swiglu.fused.p.cpp:                                not built
is_triton_available:                               True
pytorch.version:                                   2.1.0a0+41361538.nv23.06
pytorch.cuda:                                      available
gpu.compute_capability:                            8.7
gpu.name:                                          Orin
build.info:                                        available
build.cuda_version:                                1104
build.python_version:                              3.8.10
build.torch_version:                               2.1.0a0+41361538.nv23.06
build.env.TORCH_CUDA_ARCH_LIST:                    None
build.env.XFORMERS_BUILD_TYPE:                     None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   None
source.privacy:                                    open source
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0a0+41361538.nv23.06 with CUDA 1104 (you have 2.1.0a0+41361538.nv23.06)
    Python  3.8.10 (you have 3.8.10)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
xFormers 0.0.24+40d3967.d20231209
memory_efficient_attention.cutlassF:               unavailable
memory_efficient_attention.cutlassB:               unavailable
memory_efficient_attention.decoderF:               unavailable
memory_efficient_attention.flshattF@0.0.0:         available
memory_efficient_attention.flshattB@0.0.0:         available
memory_efficient_attention.smallkF:                unavailable
memory_efficient_attention.smallkB:                unavailable
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             unavailable
swiglu.gemm_fused_operand_sum:                     unavailable
swiglu.fused.p.cpp:                                not built
is_triton_available:                               True
pytorch.version:                                   2.1.0a0+41361538.nv23.06
pytorch.cuda:                                      available
gpu.compute_capability:                            8.7
gpu.name:                                          Orin
build.info:                                        available
build.cuda_version:                                1104
build.python_version:                              3.8.10
build.torch_version:                               2.1.0a0+41361538.nv23.06
build.env.TORCH_CUDA_ARCH_LIST:                    None
build.env.XFORMERS_BUILD_TYPE:                     None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   None
source.privacy:                                    open source

it seems an issue with xformers. I also submitted an issue to xformers here.