OOM issue - Githubissues

Thank you for your support.

We currently have no plans to provide multi-gpu inference script. The goal of this repository is to offer scripts for reproducing the results of our paper, and to provide efficient implementations of streaming attention.

The simple patch code we are currently using can be found in inf_llm/utils/patch.py. Attention for different layers can run on different devices, and you can modify the patch code to achieve simple block-level model parallelism.

We encourage users to utilize the implementation in inf_llm/attention within other frameworks.

thunlp / InfLLM

OOM issue #23