Open l3utterfly opened 3 months ago
@l3utterfly thanks for reporting this issue! It's likely related to the memory planning algorithm. We are investigating it and will get back to you when we find anything.
(unlike the Llama3 versions).
How much memory does Llama3 consume with that max_seq_len
Im kind of surprised the array is 3 long. I thought by default we only had 1 mem_id so the list would be [0, N]. Is this a custom plan @l3utterfly ?
Yes, I'm using a custom memory planner, which simply lifts all the mutable buffers to a separate mem_id:
class KVMemIdMemoryPlanningPass(MemoryPlanningPass):
def run(self, graph_module: torch.fx.GraphModule, graph_signature: Optional[ExportGraphSignature]) -> PassResult:
for subgm in graph_module.modules():
if not isinstance(subgm, torch.fx.GraphModule):
continue
for node in subgm.graph.nodes:
if _is_mutable_buffer(node, graph_signature):
print(f"Mutable buffer found: {node}")
node.meta["spec"].mem_id = 2
return super().run(graph_module, graph_signature)
The original Llama3 models shows the exact same memory layout: [0, 2052125776, 536870912]
, and it uses the expected amount of memory.
Compilling a PTE file without my custom memory map results in: [0, 2588996688]
activation memory requirement, which is simply the sum of my two memory arenas.
So to recap:
Ahh hmm. Are you using any delegates? Can you post the script you use to lower the model? @l3utterfly
Off the top of my head the areas that memory could be consumed are.
I think our benchmarking suite should help locate where the memory is located and from there we could investigate more. I think this is the link https://pytorch.org/executorch/stable/sdk-overview.html but cc @tarun292 just in case.
🐛 Describe the bug
I compiled the Llama3.1 models to executorch: https://huggingface.co/l3utterfly/Meta-Llama-3.1-8B-Instruct-executorch
It seems they use a lot of extra memory during inferencing (unlike the Llama3 versions). During compilation, I get the output:
Required memory for activation in bytes: [0, 2052125776, 536870912]
for the 2048 seq length compilation, which is around 2.5GB, which is expected. But when running the model, it uses around 10GB RAM.Is this expected? I'm wondering if I'm doing anything wrong during compilation or inference?
(You can compare with any of the llama3 executorch models in my repo: https://huggingface.co/l3utterfly/Llama-3-8B-Stroganoff-executorch)
Versions
1114539fd14b84450d6993a96c37ae176ff2144e