Closed HighSec-org closed 2 weeks ago
@HighSec-org Can we move the discussion to: https://huggingface.co/dunzhang/stella_en_400M_v5/discussions/23 Thanks!
RE: xformers:
It's an https://github.com/facebookresearch/xformers limitation @HighSec-org if you want to not follow link above: I think if 7.5 is not supported, its for similar reasons as in flash-attn (missing shared memory control / hw support etc.?)
System Info
OS version: linux Model being used: dunzhang/stella_en_400M_v5 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ Hardware used: Quadro RTX 8000 PyTorch version 2.5.1
Information
Tasks
Reproduction
infinity_emb v2 --device cuda --model-id dunzhang/stella_en_400M_v5
Produces error: NotImplementedError: No operator found for
memory_efficient_attention_forward
with inputs: query : shape=(1, 3, 16, 64) (torch.bfloat16) key : shape=(1, 3, 16, 64) (torch.bfloat16) value : shape=(1, 3, 16, 64) (torch.bfloat16) attn_bias : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalMask'> p : 0.0fa2F@2.6.3
is not supported because: requires device with capability > (8, 0) but your GPU has capability (7, 5) (too old) bf16 is only supported on A100+ GPUscutlassF-pt
is not supported because: bf16 is only supported on A100+ GPUsHowever, the package says they are available python -m xformers.info
xFormers 0.0.29.dev939 memory_efficient_attention.ckF: unavailable memory_efficient_attention.ckB: unavailable memory_efficient_attention.ck_decoderF: unavailable memory_efficient_attention.ck_splitKF: unavailable memory_efficient_attention.cutlassF-pt: available memory_efficient_attention.cutlassB-pt: available memory_efficient_attention.fa2F@2.6.3: available memory_efficient_attention.fa2B@2.6.3: available memory_efficient_attention.fa3F@0.0.0: unavailable memory_efficient_attention.fa3B@0.0.0: unavailable memory_efficient_attention.triton_splitKF: available indexing.scaled_index_addF: unavailable indexing.scaled_index_addB: unavailable indexing.index_select: unavailable sequence_parallel_fused.write_values: available sequence_parallel_fused.wait_values: available sequence_parallel_fused.cuda_memset_32b_async: available sp24.sparse24_sparsify_both_ways: available sp24.sparse24_apply: available sp24.sparse24_apply_dense_output: available sp24._sparse24_gemm: available sp24._cslt_sparse_mm_search@0.6.2: available sp24._cslt_sparse_mm@0.6.2: available swiglu.dual_gemm_silu: available swiglu.gemm_fused_operand_sum: available swiglu.fused.p.cpp: available is_triton_available: False pytorch.version: 2.5.1+cu124 pytorch.cuda: available gpu.compute_capability: 7.5 gpu.name: Quadro RTX 8000 dcgm_profiler: unavailable build.info: available build.cuda_version: 1201 build.hip_version: None build.python_version: 3.10.15 build.torch_version: 2.5.1+cu121 build.env.TORCH_CUDA_ARCH_LIST: 6.0+PTX 7.0 7.5 8.0+PTX 9.0a build.env.PYTORCH_ROCM_ARCH: None build.env.XFORMERS_BUILD_TYPE: Release build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None build.env.NVCC_FLAGS: -allow-unsupported-compiler build.env.XFORMERS_PACKAGE_FROM: wheel-main build.nvcc_version: 12.1.66 source.privacy: open source
so it seems that 8.0+PTX needs to be added somewhere, because it shows fa2F@2.6.3 and cutlassF-pt are available, but only checks for TORCH_CUDA_ARCH 8.0 and fails on 8.0+PTX