CLIPTokenizerFast cause memory leak

janchen0611 commented 3 months ago

System Info

transformers version: 4.36.1
Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.31
Python version: 3.10.0
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.25.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: GPU
Using distributed or parallel set-up in script?: NO

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I found the CLIPTokenizerFast will not be released. Try below code:

from transformers import CLIPTokenizerFast
import objgraph
import gc

if __name__ == "__main__":
    objgraph.growth(limit=10)

    print ("create tokenizer...")
    for i in range(10):
        tokenizer = CLIPTokenizerFast.from_pretrained('/workspace/models/stable-diffusion/tokenizer')
        del tokenizer
    print ("GC")
    gc.collect()

    print ("Growth objects:")
    print ("===")
    objgraph.show_growth(limit=10)

And the output will be:

create tokenizer...
GC
Growth objects:
===
dict                          24115       +23
cell                           9345       +20
list                           9103       +11
tuple                         23711       +11
builtin_function_or_method    10499       +10
function                      38346       +10
CLIPTokenizerFast                10       +10
RegexFlag                        57        +4
frozenset                       439        +1
Pattern                         175        +1

Then you can see the CLIPTokenizerFast objects are still exist 10 instances.

I also checked the CLIPTokenizerFast's code, I guess it cause by the _wrap_decode_method_backend_tokenizer() hacking. https://github.com/huggingface/transformers/blob/v4.41.0/src/transformers/models/clip/tokenization_clip_fast.py#L94

It hacked backend tokenizer's decode function to an closure function of object itself. It created a circular reference.

Expected behavior

The CLIPTokenizerFast objects should be correctly released after garbage collection.

ArthurZucker commented 3 months ago

Hey! Thanks for reporting. Is this only affecting ClipTokenizerFast? This seems valid and I can reproduce:

In [2]: from transformers import CLIPTokenizerFast
   ...: import objgraph
   ...: import gc
   ...: 
   ...: if __name__ == "__main__":
   ...:     objgraph.growth(limit=10)
   ...: 
   ...:     print ("create tokenizer...")
   ...:     for i in range(10):
   ...:         tokenizer = CLIPTokenizerFast.from_pretrained('openai/clip-vit-base-patch32')
   ...:         del tokenizer
   ...:     print ("GC")
   ...:     gc.collect()
   ...: 
   ...:     print ("Growth objects:")
   ...:     print ("===")
   ...:     objgraph.show_growth(limit=10)
   ...: 
create tokenizer...
GC
Growth objects:
===
dict                          55027      +656
function                      48480      +445
Base                            268      +268
tuple                         35246      +219
TagInfo                         147      +147
ReferenceType                  7983       +97
builtin_function_or_method    10369       +90
method_descriptor              5675       +76
list                          32860       +66
type                           4460       +50

dhaivat1729 commented 3 months ago

@ArthurZucker

I tried to take a stab at this here: https://github.com/huggingface/transformers/pull/31075

This is my first pull request so please let me know if I need to do anything else. This is my first PR to HF transformers so I don't know if I did everything correctly. There was a circular reference _wrap_decode_method_backend_tokenizer method, so I just introduced a local variable and passed that to circumvent the issue.

ArthurZucker commented 3 months ago

alright I'll have a look! 🤗

huggingface / transformers