thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
MIT License
290 stars 26 forks source link

OOM issue #23

Closed microhu closed 6 months ago

microhu commented 6 months ago

great work. When I try to run a 13B model with more than 100K context length of Passkey retrival task, it throws 'OOM' issue. Will the code itself support multi-gpu inference?

guyan364 commented 6 months ago

Thank you for your support.

We currently have no plans to provide multi-gpu inference script. The goal of this repository is to offer scripts for reproducing the results of our paper, and to provide efficient implementations of streaming attention.

The simple patch code we are currently using can be found in inf_llm/utils/patch.py. Attention for different layers can run on different devices, and you can modify the patch code to achieve simple block-level model parallelism.

We encourage users to utilize the implementation in inf_llm/attention within other frameworks.