hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
20.1k stars 1.91k forks source link

Fix/memory leak #526

Closed zhengzangw closed 1 week ago

zhengzangw commented 1 week ago

This PR is to reduce the memory leak problem in the dataloader. When tested on 8 GPUs with 8 dataloaders (360p, local bs=5), the memory usage drops from ~450G to <300G.

Screenshot 2024-06-22 at 02 31 25

Memory usage calculation

The size for a batch 51 x 360p x 5 is 1.3G. With the default prefetch_factor 2, the preloaded batch consumes 16 1.3 8 = 166G. Loading a 1080p video tends to use more than 10G. With 8 dataloaders, loading will consumes ~100G. Plus other memory usage, 300G is acceptable.

Memory leak reason

  1. torchvision.io.read calls pyav, which causes the memory leak. The problem is that when you allocate memory after iterating pyav container (iter(container.decode({"video": 0"})), the memory leaks. The root cause is not clearly identified, but it is likely that when iterating, multiple threads are created, and allocating memory that cannot be de-allocated immediately will lead to a memory leak (e.g., stored in a python list)
  2. torchvision.io.read did not do a container.close() and gc.collect() frequently enought.
  3. Some objects need to be explicitly deleted in the dataloader to prevent memory leak.

Memory leak solution

For the latter two, the solution is straightforward. For the first one, we rewrite the torchvision.io.read by creating a numpy buffer in advance to avoid the memory leak.

Other knowing memory leak (won't be fixed recently)

  1. Creating models lead to memory leak (~4G)
  2. pyav still leaks some memory