microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.94k stars 3.98k forks source link

[BUG] EvoFormer test case fails on H100s #5052

Open asaiacai opened 5 months ago

asaiacai commented 5 months ago

Describe the bug EvoFormer attention kernel test case fails non deterministically on H100s.

To Reproduce Run, pytest -s tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py

Expected behavior This passed for me on A100s.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
evoformer_attn ......... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/paperspace/miniconda3/envs/diffuse/lib/python3.10/site-packages/torch']
torch version .................... 2.2.0+cu121
deepspeed install path ........... ['/home/paperspace/miniconda3/envs/diffuse/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 806.13 GB

System info (please complete the following information):

Environment

torch==2.2.0
deepspeed==0.13.1

Additional context The test cases pass on x4 A100-80GB

loadams commented 5 months ago

Can you share some of the errors that you are seeing?

loadams commented 5 months ago

@asaiacai - could you share the error output?

asaiacai commented 5 months ago

this was my output

$ pytest -s tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py
============================= test session starts ==============================
platform linux -- Python 3.11.7, pytest-8.0.0, pluggy-1.4.0 -- /usr/local/bin/python3
cachedir: .pytest_cache
rootdir: /home/paperspace/DeepSpeed/tests
configfile: pytest.ini
plugins: anyio-4.0.0
collecting ... [2024-02-14 06:30:47,448] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
collected 4 items                                                              

tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py::test_DS4Sci_EvoformerAttention[tensor_shape0-dtype0] Using /home/paperspace/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Creating extension directory /home/paperspace/.cache/torch_extensions/py311_cu121/evoformer_attn...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/paperspace/.cache/torch_extensions/py311_cu121/evoformer_attn/build.ninja...
Building extension module evoformer_attn...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] c++ -MMD -MF attention.o.d -DTORCH_EXTENSION_NAME=evoformer_attn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/paperspace/cutlass/include -I/home/paperspace/cutlass/tools/util/include -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include/TH -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DBF16_AVAILABLE -c /home/paperspace/.local/lib/python3.11/site-packages/deepspeed/ops/csrc/deepspeed4science/evoformer_attn/attention.cpp -o attention.o 
[2/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=evoformer_attn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/paperspace/cutlass/include -I/home/paperspace/cutlass/tools/util/include -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include/TH -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90,code=compute_90 -DGPU_ARCH=90 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -c /home/paperspace/.local/lib/python3.11/site-packages/deepspeed/ops/csrc/deepspeed4science/evoformer_attn/attention_cu.cu -o attention_cu.cuda.o 
[3/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=evoformer_attn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/paperspace/cutlass/include -I/home/paperspace/cutlass/tools/util/include -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include/TH -isystem /home/paperspace/.local/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90,code=compute_90 -DGPU_ARCH=90 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -c /home/paperspace/.local/lib/python3.11/site-packages/deepspeed/ops/csrc/deepspeed4science/evoformer_attn/attention_back.cu -o attention_back.cuda.o 
[4/4] c++ attention.o attention_back.cuda.o attention_cu.cuda.o -shared -lcurand -L/home/paperspace/.local/lib/python3.11/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o evoformer_attn.so
Loading extension module evoformer_attn...
Time to load evoformer_attn op: 308.550683259964 seconds
PASSED
tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py::test_DS4Sci_EvoformerAttention[tensor_shape0-dtype1] PASSED
tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py::test_DS4Sci_EvoformerAttention[tensor_shape1-dtype0] PASSED
tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py::test_DS4Sci_EvoformerAttention[tensor_shape1-dtype1] FAILED

=================================== FAILURES ===================================
_____________ test_DS4Sci_EvoformerAttention[tensor_shape1-dtype1] _____________

dtype = torch.bfloat16, tensor_shape = (1, 512, 256, 8, 8)

    @pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])
    @pytest.mark.parametrize("tensor_shape", [(1, 256, 256, 4, 32), (1, 512, 256, 8, 8)])
    def test_DS4Sci_EvoformerAttention(dtype, tensor_shape):
        skip_on_arch(8 if dtype == torch.bfloat16 else 7)
        batch, n, seq_len, heads, dim = tensor_shape
        Q = torch.randn(batch,
                        n,
                        seq_len,
                        heads,
                        dim,
                        dtype=dtype,
                        device=get_accelerator().device_name(),
                        requires_grad=True)
        K = torch.randn(batch,
                        n,
                        seq_len,
                        heads,
                        dim,
                        dtype=dtype,
                        device=get_accelerator().device_name(),
                        requires_grad=True)
        V = torch.randn(batch,
                        n,
                        seq_len,
                        heads,
                        dim,
                        dtype=dtype,
                        device=get_accelerator().device_name(),
                        requires_grad=True)
        mask = torch.randint(0, 2, (batch, n, 1, 1, seq_len), dtype=dtype, device=get_accelerator().device_name())
        mask_bias = 1e9 * (mask - 1)
        bias = torch.randn(batch,
                           1,
                           heads,
                           seq_len,
                           seq_len,
                           dtype=dtype,
                           device=get_accelerator().device_name(),
                           requires_grad=True)
        dummy_out = torch.rand_like(Q, dtype=dtype, device=get_accelerator().device_name())
        ref_out = attention_reference(Q, K, V, [mask_bias, bias], 1 / (dim**0.5))
        ref_out.backward(dummy_out)
        ref_dv, V.grad = V.grad.clone(), None
        ref_dk, K.grad = K.grad.clone(), None
        ref_dq, Q.grad = Q.grad.clone(), None
        ref_db, bias.grad = bias.grad.clone(), None

        out = DS4Sci_EvoformerAttention(Q, K, V, [mask_bias, bias])
        out.backward(dummy_out)
        dv, v_grad = V.grad.clone(), None
        dk, k_grad = K.grad.clone(), None
        dq, q_grad = Q.grad.clone(), None
        db, bias.grad = bias.grad.clone(), None

        eps = 1e-2 if dtype == torch.float16 else 5e-2

        assert torch.max(torch.abs(ref_out - out)).item() < eps, f"out eps: {torch.max(torch.abs(ref_out - out))}"
        assert torch.max(torch.abs(ref_dv - dv)) < eps, f"dv eps: {torch.max(torch.abs(ref_dv - dv))}"
>       assert torch.max(torch.abs(ref_dk - dk)) < eps, f"dk eps: {torch.max(torch.abs(ref_dk - dk))}"
E       AssertionError: dk eps: 0.0625
E       assert tensor(0.0625, device='cuda:0', dtype=torch.bfloat16) < 0.05
E        +  where tensor(0.0625, device='cuda:0', dtype=torch.bfloat16) = <built-in method max of type object at 0x7f450847aaa0>(tensor([[[[[0.0000e+00, 0.0000e+00, 1.0681e-04,  ..., 0.0000e+00,\n            2.4414e-04, 1.0376e-03],\n           [9.7656e-04, 0.0000e+00, 4.8828e-04,  ..., 0.0000e+00,\n            0.0000e+00, 6.1798e-04],\n           [4.8828e-04, 4.8828e-04, 3.0518e-04,  ..., 8.5449e-04,\n            9.7656e-04, 9.7656e-04],\n           ...,\n           [0.0000e+00, 4.8828e-04, 6.1035e-04,  ..., 0.0000e+00,\n            0.0000e+00, 7.9346e-04],\n           [4.8828e-04, 9.7656e-04, 9.7656e-04,  ..., 0.0000e+00,\n            0.0000e+00, 4.8828e-04],\n           [0.0000e+00, 4.8828e-04, 0.0000e+00,  ..., 9.7656e-04,\n            0.0000e+00, 1.4648e-03]],\n\n          [[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           ...,\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e... [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           ...,\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00]],\n\n          [[3.6621e-04, 0.0000e+00, 2.4414e-04,  ..., 2.4414e-04,\n            2.4414e-04, 0.0000e+00],\n           [4.8828e-04, 7.3242e-04, 1.2207e-04,  ..., 1.8311e-04,\n            0.0000e+00, 4.8828e-04],\n           [0.0000e+00, 7.3242e-04, 4.8828e-04,  ..., 4.8828e-04,\n            9.7656e-04, 2.2888e-04],\n           ...,\n           [0.0000e+00, 0.0000e+00, 1.9531e-03,  ..., 0.0000e+00,\n            9.7656e-04, 0.0000e+00],\n           [2.4414e-04, 1.2207e-04, 4.8828e-04,  ..., 4.8828e-04,\n            0.0000e+00, 0.0000e+00],\n           [1.9531e-03, 9.7656e-04, 0.0000e+00,  ..., 4.8828e-04,\n            9.7656e-04, 0.0000e+00]]]]], device='cuda:0', dtype=torch.bfloat16))
E        +    where <built-in method max of type object at 0x7f450847aaa0> = torch.max
E        +    and   tensor([[[[[0.0000e+00, 0.0000e+00, 1.0681e-04,  ..., 0.0000e+00,\n            2.4414e-04, 1.0376e-03],\n           [9.7656e-04, 0.0000e+00, 4.8828e-04,  ..., 0.0000e+00,\n            0.0000e+00, 6.1798e-04],\n           [4.8828e-04, 4.8828e-04, 3.0518e-04,  ..., 8.5449e-04,\n            9.7656e-04, 9.7656e-04],\n           ...,\n           [0.0000e+00, 4.8828e-04, 6.1035e-04,  ..., 0.0000e+00,\n            0.0000e+00, 7.9346e-04],\n           [4.8828e-04, 9.7656e-04, 9.7656e-04,  ..., 0.0000e+00,\n            0.0000e+00, 4.8828e-04],\n           [0.0000e+00, 4.8828e-04, 0.0000e+00,  ..., 9.7656e-04,\n            0.0000e+00, 1.4648e-03]],\n\n          [[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           ...,\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e... [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           ...,\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00],\n           [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,\n            0.0000e+00, 0.0000e+00]],\n\n          [[3.6621e-04, 0.0000e+00, 2.4414e-04,  ..., 2.4414e-04,\n            2.4414e-04, 0.0000e+00],\n           [4.8828e-04, 7.3242e-04, 1.2207e-04,  ..., 1.8311e-04,\n            0.0000e+00, 4.8828e-04],\n           [0.0000e+00, 7.3242e-04, 4.8828e-04,  ..., 4.8828e-04,\n            9.7656e-04, 2.2888e-04],\n           ...,\n           [0.0000e+00, 0.0000e+00, 1.9531e-03,  ..., 0.0000e+00,\n            9.7656e-04, 0.0000e+00],\n           [2.4414e-04, 1.2207e-04, 4.8828e-04,  ..., 4.8828e-04,\n            0.0000e+00, 0.0000e+00],\n           [1.9531e-03, 9.7656e-04, 0.0000e+00,  ..., 4.8828e-04,\n            9.7656e-04, 0.0000e+00]]]]], device='cuda:0', dtype=torch.bfloat16) = <built-in method abs of type object at 0x7f450847aaa0>((tensor([[[[[-2.7344e-01,  1.5723e-01, -2.1667e-03,  ..., -1.2305e-01,\n            -5.5176e-02,  1.4954e-02],\n           [ 1.5332e-01,  1.3086e-01,  6.7383e-02,  ..., -1.0449e-01,\n            -6.6895e-02,  7.6294e-04],\n           [ 9.6191e-02, -4.5898e-02, -5.3406e-03,  ...,  2.6733e-02,\n             2.3828e-01,  1.6211e-01],\n           ...,\n           [ 1.8359e-01,  9.4727e-02, -2.0508e-02,  ...,  2.5635e-02,\n             1.5723e-01, -5.8899e-03],\n           [-1.2207e-01,  1.5527e-01, -5.1025e-02,  ...,  3.2617e-01,\n             3.4766e-01, -1.2305e-01],\n           [ 3.0078e-01,  3.6865e-02, -3.4570e-01,  ...,  2.2827e-02,\n            -3.7500e-01,  1.9409e-02]],\n\n          [[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           ...,\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n    ...0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           ...,\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00]],\n\n          [[-1.2756e-02, -8.0078e-02, -3.9551e-02,  ..., -1.8677e-02,\n            -3.2715e-02, -9.5703e-02],\n           [-8.9355e-02,  1.9775e-02, -2.9663e-02,  ...,  1.0010e-02,\n             1.3281e-01,  4.3213e-02],\n           [ 2.0020e-01,  4.8340e-02, -5.3223e-02,  ...,  8.3984e-02,\n             7.6172e-02, -3.1738e-03],\n           ...,\n           [ 5.3516e-01,  3.2812e-01,  3.7109e-01,  ...,  8.9844e-01,\n             1.7871e-01,  6.0547e-01],\n           [ 1.7334e-02,  2.6489e-02,  9.8145e-02,  ...,  3.4668e-02,\n            -3.2812e-01,  3.0273e-01],\n           [-3.8281e-01, -2.1582e-01, -2.1094e-01,  ..., -7.8613e-02,\n             2.1484e-01,  1.1621e-01]]]]], device='cuda:0',\n       dtype=torch.bfloat16) - tensor([[[[[-2.7344e-01,  1.5723e-01, -2.0599e-03,  ..., -1.2305e-01,\n            -5.5420e-02,  1.3916e-02],\n           [ 1.5234e-01,  1.3086e-01,  6.6895e-02,  ..., -1.0449e-01,\n            -6.6895e-02,  1.3809e-03],\n           [ 9.5703e-02, -4.5410e-02, -5.6458e-03,  ...,  2.5879e-02,\n             2.3926e-01,  1.6113e-01],\n           ...,\n           [ 1.8359e-01,  9.4238e-02, -2.1118e-02,  ...,  2.5635e-02,\n             1.5723e-01, -6.6833e-03],\n           [-1.2256e-01,  1.5430e-01, -5.0049e-02,  ...,  3.2617e-01,\n             3.4766e-01, -1.2256e-01],\n           [ 3.0078e-01,  3.6377e-02, -3.4570e-01,  ...,  2.3804e-02,\n            -3.7500e-01,  1.7944e-02]],\n\n          [[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           ...,\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n    ...0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           ...,\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00],\n           [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,\n             0.0000e+00,  0.0000e+00]],\n\n          [[-1.3123e-02, -8.0078e-02, -3.9307e-02,  ..., -1.8433e-02,\n            -3.2471e-02, -9.5703e-02],\n           [-8.9844e-02,  1.9043e-02, -2.9785e-02,  ...,  9.8267e-03,\n             1.3281e-01,  4.3701e-02],\n           [ 2.0020e-01,  4.9072e-02, -5.3711e-02,  ...,  8.4473e-02,\n             7.7148e-02, -2.9449e-03],\n           ...,\n           [ 5.3516e-01,  3.2812e-01,  3.7305e-01,  ...,  8.9844e-01,\n             1.7773e-01,  6.0547e-01],\n           [ 1.7578e-02,  2.6611e-02,  9.7656e-02,  ...,  3.5156e-02,\n            -3.2812e-01,  3.0273e-01],\n           [-3.8086e-01, -2.1680e-01, -2.1094e-01,  ..., -7.9102e-02,\n             2.1387e-01,  1.1621e-01]]]]], device='cuda:0',\n       dtype=torch.bfloat16)))
E        +      where <built-in method abs of type object at 0x7f450847aaa0> = torch.abs

tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py:101: AssertionError
=============================== warnings summary ===============================
../../../usr/lib/python3/dist-packages/pkg_resources/_vendor/pyparsing.py:87
  /usr/lib/python3/dist-packages/pkg_resources/_vendor/pyparsing.py:87: DeprecationWarning: module 'sre_constants' is deprecated
    import sre_constants

../../../usr/lib/python3/dist-packages/pytz/__init__.py:31
  /usr/lib/python3/dist-packages/pytz/__init__.py:31: DeprecationWarning: invalid escape sequence '\s'
    match = re.match("^#\s*version\s*([0-9a-z]*)\s*$", line)

unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py::test_DS4Sci_EvoformerAttention[tensor_shape0-dtype0]
  /home/paperspace/DeepSpeed/tests/conftest.py:47: UserWarning: Running test without verifying torch version, please provide an expected torch version with --torch_ver
    warnings.warn(

unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py::test_DS4Sci_EvoformerAttention[tensor_shape0-dtype0]
  /home/paperspace/DeepSpeed/tests/conftest.py:54: UserWarning: Running test without verifying cuda version, please provide an expected cuda version with --cuda_ver
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================== slowest durations ===============================
309.09s call     unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py::test_DS4Sci_EvoformerAttention[tensor_shape0-dtype0]

(11 durations < 1s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py::test_DS4Sci_EvoformerAttention[tensor_shape1-dtype1] - AssertionError: dk eps: 0.0625
============= 1 failed, 3 passed, 4 warnings in 312.62s (0:05:12) ==============
arogozhnikov commented 4 months ago

observing same issue on A100 (second half of channel components is incorrect) when cross-checking with pytorch's sdpa

arogozhnikov commented 4 months ago

correction:

  1. my version was compiled (with default setting of) K <= 64, but I used 128 channels per head
  2. I've tested only forward pass, and it does not complain on this, but backward does complain.