libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.8k stars 180 forks source link

Thread leak with FFCV x tqdm #234

Open ed1d1a8d opened 1 year ago

ed1d1a8d commented 1 year ago

I used FFCV with tqdm for long running job, and noticed it crashed due to using too many threads (~4000 threads had built up over 30 hours in a linear fashion, and the OS eventually stepped and forbid my process from creating any new threads).

image

Upon further investigation, I found that when FFCV is used with tqdm, there seems to be a thread-leak (i.e. new threads are created that never get deleted).

Here's a reproduction of the issue: https://gist.github.com/ed1d1a8d/424e5bc83325c93037cfe2de9e457a68

I'm curious if this is an issue with FFCV or an issue with tqdm, and if it is a known problem.

TL;DR Is seems like the following ways of using ffcv with tqdm are broken:

# This has a thread leak
for _ in tqdm(loader):
    pass

# This also has a thread leak
with tqdm(loader) as pbar:
    for _ in pbar:
        pass

but the following methods are OK:

# Without tqdm, there is no thread leak!
for _ in loader:
    pass

# Manual tqdm is also okay!
with tqdm(total=len(loader)) as pbar:
    for _ in loader:
        pbar.update(1)
GuillaumeLeclerc commented 1 year ago

The only explanation I could find here is that tqdm somehow keeps a reference on the iterator which also extends Thread in FFCV. by using manual tqdm you are not giving a reference on the iterator to tqdm so it can't keep it not exhibiting the problem. It must therefore be a problem with tqdm. What happens if you .join() on the iterator after the iteration does it block forever ? If it doesn't block then it means the thread is completed and tqdm is just keeping a reference there for some sort of weird reason. You can also inspect the garbage collector to see who is actually holding reference to the object blocking its collection