pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
https://pytorch.org/TensorRT
BSD 3-Clause "New" or "Revised" License
2.54k stars 349 forks source link

🐛 [torch.export][llama2] Accuracy issues with llama model #2964

Closed peri044 closed 1 month ago

peri044 commented 3 months ago

Bug Description

The outputs of TRT compilation do not match with PyTorch for llama2 model. These are the causes for it.

1) Running in FP16 precision (layernorm warns about FP16 precision not being enough). So, we need to compile in FP32 precision

2) Rotation : https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L152-L156 This block leads to output mismatch

3) Adding attention mask https://github.com/huggingface/transformers/blob/e65502951593a76844e872fee9c56b805598538a/src/transformers/models/llama/modeling_llama.py#L347-L349 These lines also cause output mismatch.

Compiling with dynamic shapes and FP32 also lead to high memory usage.

To Reproduce

Steps to reproduce the behavior:

1. 2. 3.

Expected behavior

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

Additional context

peri044 commented 3 months ago

If I enable strongly typed flag, I face the following error.

ERROR:torch_tensorrt [TensorRT Conversion Context]:10: setOutputType cannot be called for a strongly typed network.
ERROR:torch_tensorrt [TensorRT Conversion Context]:1: [network.cpp::setOutputType::757] Error Code 1: Internal Error (Invalid use of API - See recorded error for details.)

We use set_output_type for casting everywhere in our codebase which will be a problem.

peri044 commented 1 month ago

This is fixed for FP32 precision in llm_examples_main PR : https://github.com/pytorch/TensorRT/pull/3002