🐛 Bug

Information

I am extracting the internal representations of some of the Marian models. There seems to be a memory leak problem. In this issue, you will find code for running the model sentence by sentence (bsz = 1) just to keep it simple. When I use batching, the problem persists and arises earlier.

Model I am using: MarianMT modelnames=[f'Helsinki-NLP/opus-mt-en-{tgt}' for tgt in ['de', 'fr', 'ee', 'sv', 'el', 'fi', 'cs', 'ru' ]]

Language I am using the model on: en-{tgt]

The problem arises when using:

[ ] a mix of official example scripts and my own: on this code, I keep the lines used to see if it is a memory problem. Hence the empty_cache(), keeping track of the memory usage with memory_stats(), and passing things to 'cpu' (but this has not solved the problem for me)

import torch
import transformers
config_overrider={'output_attentions':True, 'output_hidden_states':True}
model = transformers.MarianMTModel.from_pretrained(modelname, **config_overrider)
tokenizer = transformers.MarianTokenizer.from_pretrained(modelname) 
model.eval()
encoded_sentences = []
memdict=[]
for sent in tqdm(sentences):
    tokdsent = self.tokenizer.prepare_translation_batch(src_texts=[' '.join(sent)]) 
    tokdsent = {k:v.to(self.device) for k,v in tokdsent.items()}
    model_outputs = self.model.forward(**tokdsent) 
    encoded_sentences.append( [x.to('cpu') for x in model_outputs[4]+model_outputs[1]]  )
    torch.cuda.empty_cache()

    memdict.append(torch.cuda.memory_stats(self.device))
    print(memdict[-1]['active.all.current'],memdict[-1]['active.all.peak']) # comment out this part

The tasks I am working on is:

[ ] Using a dataset from an official task: semantic textual similarity - STS 2012, 2013, 2014, 2015 and 2016 (can use this one)

To reproduce

Steps to reproduce the behavior:

Load and tokenize the sentences (need this for what I am doing, even when I detok when passing it to the tokenizer)
```
STS_path = "path/to/allSTS.txt"
with open(STS_path, 'r') as f:
samples = f.readlines()
```

sentences = [] for sent in samples: sent = sent.strip() sent = re.findall(r'[\w]+|.|,|\?|!|;|:|\'|(|)|/',sent) sentences.append(sent)

2. Run the code above. The one on _"The problem arises when using:"_
3. For me, around line 3750 I get OOM:

RuntimeError('CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 31.75 GiB total capacity; 30.67 GiB already allocated; 17.69 MiB free; 30.67 GiB reserved in total by PyTorch)')

Here I copy some of the linesprinted from `active.all.current ` and `active.all.peak` (it never changes the upwards trend):

533 535 779 811 1025 1057 1271 1303 1517 1549 1763 1795 2009 2041 2255 2287 2501 2533 2747 2779 2993 3025 ... 9635 9667 9881 9913 10127 10159 10373 10405 10619 10651 ... 921311 921343 921557 921589 921803 921835 922049 922081 922295 922327 922541 922573


^-- these are the first 10 lines, somewhere around 40 sentences, and the last lines before running out of mem - close to 3750 sents. 

<!-- If you have code snippets, error messages, stack traces please provide them here as well.
     Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
     Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->

## Expected behavior

I would expect the memory on the cuda device to be freed after every iteration since I am overwriting the variable there and what I append to the list I want to keep is sent to cpu.  

## Environment info
<!-- You can run the command `transformers-cli env` and copy-and-paste its output below.
     Don't forget to fill out the missing fields in that output! -->

- `transformers` version:
- Platform: Linux 3.10.0-1062.7.1.el7.x86_64 x86_64, Red Hat Enterprise Linux Server 7.7 (Maipo)
- Python version: 3.7.3
- PyTorch version (GPU?): 1.5.0 for cuda 10.2 (Nvidia Volta V100 GPU with 32 GB of memory)
- Tensorflow version (GPU?):  not using tf
- Using GPU in script?: yes (but I have seen the same problem on CPU)
- Using distributed or parallel set-up in script?: no

huggingface / transformers

[marian] possible memory leak problem while translating & extracting internal representations #4518

🐛 Bug

Information

To reproduce