huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.68k stars 26.22k forks source link

CLIPTokenizerFast cause memory leak #30930

Closed janchen0611 closed 3 months ago

janchen0611 commented 3 months ago

System Info

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

I found the CLIPTokenizerFast will not be released. Try below code:

from transformers import CLIPTokenizerFast
import objgraph
import gc

if __name__ == "__main__":
    objgraph.growth(limit=10)

    print ("create tokenizer...")
    for i in range(10):
        tokenizer = CLIPTokenizerFast.from_pretrained('/workspace/models/stable-diffusion/tokenizer')
        del tokenizer
    print ("GC")
    gc.collect()

    print ("Growth objects:")
    print ("===")
    objgraph.show_growth(limit=10)

And the output will be:

create tokenizer...
GC
Growth objects:
===
dict                          24115       +23
cell                           9345       +20
list                           9103       +11
tuple                         23711       +11
builtin_function_or_method    10499       +10
function                      38346       +10
CLIPTokenizerFast                10       +10
RegexFlag                        57        +4
frozenset                       439        +1
Pattern                         175        +1

Then you can see the CLIPTokenizerFast objects are still exist 10 instances.

I also checked the CLIPTokenizerFast's code, I guess it cause by the _wrap_decode_method_backend_tokenizer() hacking. https://github.com/huggingface/transformers/blob/v4.41.0/src/transformers/models/clip/tokenization_clip_fast.py#L94

It hacked backend tokenizer's decode function to an closure function of object itself. It created a circular reference.

Expected behavior

The CLIPTokenizerFast objects should be correctly released after garbage collection.

ArthurZucker commented 3 months ago

Hey! Thanks for reporting. Is this only affecting ClipTokenizerFast? This seems valid and I can reproduce:

In [2]: from transformers import CLIPTokenizerFast
   ...: import objgraph
   ...: import gc
   ...: 
   ...: if __name__ == "__main__":
   ...:     objgraph.growth(limit=10)
   ...: 
   ...:     print ("create tokenizer...")
   ...:     for i in range(10):
   ...:         tokenizer = CLIPTokenizerFast.from_pretrained('openai/clip-vit-base-patch32')
   ...:         del tokenizer
   ...:     print ("GC")
   ...:     gc.collect()
   ...: 
   ...:     print ("Growth objects:")
   ...:     print ("===")
   ...:     objgraph.show_growth(limit=10)
   ...: 
create tokenizer...
GC
Growth objects:
===
dict                          55027      +656
function                      48480      +445
Base                            268      +268
tuple                         35246      +219
TagInfo                         147      +147
ReferenceType                  7983       +97
builtin_function_or_method    10369       +90
method_descriptor              5675       +76
list                          32860       +66
type                           4460       +50
dhaivat1729 commented 3 months ago

@ArthurZucker

I tried to take a stab at this here: https://github.com/huggingface/transformers/pull/31075

This is my first pull request so please let me know if I need to do anything else. This is my first PR to HF transformers so I don't know if I did everything correctly. There was a circular reference _wrap_decode_method_backend_tokenizer method, so I just introduced a local variable and passed that to circumvent the issue.

ArthurZucker commented 3 months ago

alright I'll have a look! 🤗