LLM during inference do not deallocate memory

System Info

transformers version: 4.41.2
Platform: Linux-5.15.0-1044-nvidia-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.23.3
Safetensors version: 0.4.3
Accelerate version: 0.30.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: True
Using distributed or parallel set-up in script?: False

Who can help?

@ArthurZucker @younesbelkada @zucchini-nlp

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Init model

base_model = 'google/flan-t5-xxl'
ckpt = './results/checkpoints_t5_1/checkpoint-4600'
device = 'cuda:0'

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    base_model,
    device_map=device,
    max_memory={0:"60GB"},
    trust_remote_code=True,
    torch_dtype=torch.float16,
    offload_state_dict=True,
)

model = PeftModel.from_pretrained(
    model,
    ckpt
)

model.eval()
model = torch.compile(model)

Create generation config

generation_config = GenerationConfig(
    do_sample=True,
    temperature=0.8,
    top_p=0.75,
    top_k=40,
    num_beams=4,
    max_new_tokens=224,
    stream_output=False,
    model=model,
    use_cache=False,
)

Generate text

inputs = create_prompt(example)
input_ids = inputs['input_ids']
input_ids = input_ids.to(device)
gt = example['output']

with torch.inference_mode():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=False,
        output_scores=False,
    )

And track memory with this function:

def show_gpu(msg):
    """
    ref: https://discuss.pytorch.org/t/access-gpu-memory-usage-in-pytorch/3192/4
    """
    def query(field):
        return(subprocess.check_output(
            ['nvidia-smi', f'--query-gpu={field}',
                '--format=csv,nounits,noheader'], 
            encoding='utf-8'))
    def to_int(result):
        return int(result.strip().split('\n')[0])

    used = to_int(query('memory.used'))
    total = to_int(query('memory.total'))
    pct = used/total
    print('\n' + msg, f'{100*pct:2.1f}% ({used} out of {total})')

After initialization we've used: GPU 32.1% (26157 out of 81559) Than firts, second and third generations:

GPU 63.0% (51387 out of 81559)
GPU 63.0% (51389 out of 81559)
GPU 93.9% (76613 out of 81559)

And we allocated more memory, than we specified in max_memory

Expected behavior

Model to free memory after generation or not to try to allocate more memory, than we specified in max_memory

huggingface / transformers

LLM during inference do not deallocate memory #31519