pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
2.15k stars 353 forks source link

Expected memory requirement for executorch llama3.1 models? #4501

Open l3utterfly opened 3 months ago

l3utterfly commented 3 months ago

🐛 Describe the bug

I compiled the Llama3.1 models to executorch: https://huggingface.co/l3utterfly/Meta-Llama-3.1-8B-Instruct-executorch

It seems they use a lot of extra memory during inferencing (unlike the Llama3 versions). During compilation, I get the output: Required memory for activation in bytes: [0, 2052125776, 536870912] for the 2048 seq length compilation, which is around 2.5GB, which is expected. But when running the model, it uses around 10GB RAM.

Is this expected? I'm wondering if I'm doing anything wrong during compilation or inference?

(You can compare with any of the llama3 executorch models in my repo: https://huggingface.co/l3utterfly/Llama-3-8B-Stroganoff-executorch)

Versions

1114539fd14b84450d6993a96c37ae176ff2144e

iseeyuan commented 3 months ago

@l3utterfly thanks for reporting this issue! It's likely related to the memory planning algorithm. We are investigating it and will get back to you when we find anything.

JacobSzwejbka commented 3 months ago

(unlike the Llama3 versions).

How much memory does Llama3 consume with that max_seq_len

JacobSzwejbka commented 3 months ago

Im kind of surprised the array is 3 long. I thought by default we only had 1 mem_id so the list would be [0, N]. Is this a custom plan @l3utterfly ?

l3utterfly commented 3 months ago

Yes, I'm using a custom memory planner, which simply lifts all the mutable buffers to a separate mem_id:

class KVMemIdMemoryPlanningPass(MemoryPlanningPass):
    def run(self, graph_module: torch.fx.GraphModule, graph_signature: Optional[ExportGraphSignature]) -> PassResult:
        for subgm in graph_module.modules():
            if not isinstance(subgm, torch.fx.GraphModule):
                continue
            for node in subgm.graph.nodes:
                if _is_mutable_buffer(node, graph_signature):
                    print(f"Mutable buffer found: {node}")
                    node.meta["spec"].mem_id = 2

        return super().run(graph_module, graph_signature)

The original Llama3 models shows the exact same memory layout: [0, 2052125776, 536870912], and it uses the expected amount of memory.

Compilling a PTE file without my custom memory map results in: [0, 2588996688] activation memory requirement, which is simply the sum of my two memory arenas.

So to recap:

JacobSzwejbka commented 3 months ago

Ahh hmm. Are you using any delegates? Can you post the script you use to lower the model? @l3utterfly

Off the top of my head the areas that memory could be consumed are.

  1. the model weights. These wont be included in the planned buffers.
  2. The method allocator. I would expect the total allocation from this to be small, but we do have benchmarking apis that would let you double check.
  3. Delegate internals. Things like CoreML, XNNPack, Vulkan. Any sort of input to exir.to_backend() really.

I think our benchmarking suite should help locate where the memory is located and from there we could investigate more. I think this is the link https://pytorch.org/executorch/stable/sdk-overview.html but cc @tarun292 just in case.