This PR is to reduce the memory leak problem in the dataloader. When tested on 8 GPUs with 8 dataloaders (360p, local bs=5), the memory usage drops from ~450G to <300G.
Memory usage calculation
The size for a batch 51 x 360p x 5 is 1.3G. With the default prefetch_factor 2, the preloaded batch consumes 16 1.3 8 = 166G. Loading a 1080p video tends to use more than 10G. With 8 dataloaders, loading will consumes ~100G. Plus other memory usage, 300G is acceptable.
Memory leak reason
torchvision.io.read calls pyav, which causes the memory leak. The problem is that when you allocate memory after iterating pyav container (iter(container.decode({"video": 0"})), the memory leaks. The root cause is not clearly identified, but it is likely that when iterating, multiple threads are created, and allocating memory that cannot be de-allocated immediately will lead to a memory leak (e.g., stored in a python list)
torchvision.io.read did not do a container.close() and gc.collect() frequently enought.
Some objects need to be explicitly deleted in the dataloader to prevent memory leak.
Memory leak solution
For the latter two, the solution is straightforward. For the first one, we rewrite the torchvision.io.read by creating a numpy buffer in advance to avoid the memory leak.
Other knowing memory leak (won't be fixed recently)
This PR is to reduce the memory leak problem in the dataloader. When tested on 8 GPUs with 8 dataloaders (360p, local bs=5), the memory usage drops from ~450G to <300G.
Memory usage calculation
The size for a batch 51 x 360p x 5 is 1.3G. With the default
prefetch_factor
2, the preloaded batch consumes 16 1.3 8 = 166G. Loading a 1080p video tends to use more than 10G. With 8 dataloaders, loading will consumes ~100G. Plus other memory usage, 300G is acceptable.Memory leak reason
torchvision.io.read
callspyav
, which causes the memory leak. The problem is that when you allocate memory after iteratingpyav
container (iter(container.decode({"video": 0"})
), the memory leaks. The root cause is not clearly identified, but it is likely that when iterating, multiple threads are created, and allocating memory that cannot be de-allocated immediately will lead to a memory leak (e.g., stored in a python list)torchvision.io.read
did not do acontainer.close()
andgc.collect()
frequently enought.Memory leak solution
For the latter two, the solution is straightforward. For the first one, we rewrite the
torchvision.io.read
by creating a numpy buffer in advance to avoid the memory leak.Other knowing memory leak (won't be fixed recently)
pyav
still leaks some memory