Closed microhu closed 6 months ago
Thank you for your support.
We currently have no plans to provide multi-gpu inference script. The goal of this repository is to offer scripts for reproducing the results of our paper, and to provide efficient implementations of streaming attention.
The simple patch code we are currently using can be found in inf_llm/utils/patch.py
.
Attention for different layers can run on different devices, and you can modify the patch code to achieve simple block-level model parallelism.
We encourage users to utilize the implementation in inf_llm/attention
within other frameworks.
great work. When I try to run a 13B model with more than 100K context length of Passkey retrival task, it throws 'OOM' issue. Will the code itself support multi-gpu inference?