[Question]: Question about KV-cache storage

Describe the issue

Thank you for the amazing work!

Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?
What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use pip install Minference w/t having FlashAttention-2 and Triton == 2.1.0 installed, and then it outputted ERROR: Failed building wheel for pycuda.

Hi @DerrickYLJ, thanks for your support in MInference.

1) MInference 1.0 focuses on speeding up the pre-filling stage of long-context LLMs inference, reducing the time from 30 minutes to 3 minutes for 1M tokens on an A100. This work does not address the KV cache storage issue. Future work on MInference will include solutions to reduce KV cache memory overhead.

However, we have made some system optimizations that allow 1M pre-filling to run on a single A100, details are shown in Appendix C.3. In our demo video, to perform 1M tokens inference on a single A100, we load the KV cache to the CPU, as shown in this code.

Additionally, several studies focus on KV cache compression (like H20, SnapKV) and KV cache quantization (KIVI). You might consider using these solutions.

2) Our pip package depends on flash-attn and triton. It looks like you're encountering issues related to pycuda. You can try the following steps: 1) Check if pycuda is installed successfully. 2) Build from source:

   git clone https://github.com/microsoft/MInference
   pip install -e .

3) If the issue persists, please provide details including OS, Python version, CUDA version, PyTorch version, and the error log.

Thanks again for your interest and support!

Thank you very much for your reply!

As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.

As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

OS: Icon name: computer-server Chassis: server Machine ID: 2305030051f947988b5faecaf45ece43 Boot ID: 00739920e39a457999c5ae3b99f47675 Operating System: Springdale Open Enterprise Linux 8.6 (Modena) CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64 Architecture: x86-64
CUDA version: 12.4
PyTorch version: 2.3.1
Python version: 3.8.12

Error Log:

from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                   from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
  bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
   # include <pyconfig.h>
             ^~~~~~~~~~~~
  compilation terminated.
  /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
  !!

          ********************************************************************************
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'pycuda.cuda' as an importable package[^1],
          but it is absent from setuptools' `packages` configuration.

          This leads to an ambiguous overall configuration. If you want to distribute this
          package, please make sure that 'pycuda.cuda' is explicitly added
          to the `packages` configuration field.

          Alternatively, you can also rely on setuptools' discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).

          You can read more about "package discovery" on setuptools documentation page:

          - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html

          If you don't want 'pycuda.cuda' to be distributed and are
          already explicitly excluding 'pycuda.cuda' via
          `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
          you can try to use `exclude_package_data`, or `include-package-data=False` in
          combination with a more fine grained `package-data` configuration.

          You can read more about "package data files" on setuptools documentation page:

          - https://setuptools.pypa.io/en/latest/userguide/datafiles.html

          [^1]: For Python, any directory (with suitable naming) can be imported,
                even if it does not contain any `.py` files.
                On the other hand, currently there is no concept of package data
                directory, all directories are treated like packages.
          ********************************************************************************

  !!
    check.warn(importable)
  error: command '/usr/bin/gcc' failed with exit code 1
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

Thank you very much for your reply!

As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.

As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

OS: Icon name: computer-server Chassis: server Machine ID: 2305030051f947988b5faecaf45ece43 Boot ID: 00739920e39a457999c5ae3b99f47675 Operating System: Springdale Open Enterprise Linux 8.6 (Modena) CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64 Architecture: x86-64
CUDA version: 12.4
PyTorch version: 2.3.1
Python version: 3.8.12
Error Log:

 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!

              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.

              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.

              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).

              You can read more about "package discovery" on setuptools documentation page:

              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html

              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.

              You can read more about "package data files" on setuptools documentation page:

              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html

              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************

      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

Hi @DerrickYLJ, thank you for the information. It appears that the issue is related to PyCUDA. We will remove the dependency on PyCUDA in the next version.

Could you please answer my first question by just briefly explaining the logic of offloading kv-cache to CPU?

Thank you very much for your reply!

As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.

As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

OS: Icon name: computer-server Chassis: server Machine ID: 2305030051f947988b5faecaf45ece43 Boot ID: 00739920e39a457999c5ae3b99f47675 Operating System: Springdale Open Enterprise Linux 8.6 (Modena) CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64 Architecture: x86-64
CUDA version: 12.4
PyTorch version: 2.3.1
Python version: 3.8.12
Error Log:

 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!

              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.

              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.

              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).

              You can read more about "package discovery" on setuptools documentation page:

              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html

              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.

              You can read more about "package data files" on setuptools documentation page:

              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html

              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************

      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

Could you please answer my first question by just briefly explaining the logic of offloading kv-cache to CPU?

Thank you very much for your reply! As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise. As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

OS: Icon name: computer-server Chassis: server Machine ID: 2305030051f947988b5faecaf45ece43 Boot ID: 00739920e39a457999c5ae3b99f47675 Operating System: Springdale Open Enterprise Linux 8.6 (Modena) CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64 Architecture: x86-64
CUDA version: 12.4
PyTorch version: 2.3.1
Python version: 3.8.12
Error Log:

 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!

              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.

              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.

              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).

              You can read more about "package discovery" on setuptools documentation page:

              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html

              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.

              You can read more about "package data files" on setuptools documentation page:

              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html

              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************

      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

Sure, the logic of "kv_cache_cpu" is very simple. When you use "kv_cache_cpu," it loads the KV cache into CPU memory. During the decoding phase, it transfers the used KV cache to GPU memory. This is just a preliminary implementation. Since our current solution only optimizes the prefilling stage and existing KV cache compression methods generally perform poorly, we implemented this version of loading for experimental and demonstration purposes. Although it has higher latency, it is faster than recomputation.

microsoft / MInference

[Question]: Question about KV-cache storage #20

Describe the issue