pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.49k stars 480 forks source link

parallel_loader: Fix TPU memory leak when calling __iter__. #8039

Closed dudulightricks closed 1 month ago

dudulightricks commented 2 months ago

A TPU memory leak occurs when calling iter on the MpDeviceLoader object, even if called only once. However, the memory growth becomes more noticeable and critical when iter is called repeatedly, eventually leading to a crash. This issue is caused by the threads not being properly terminated because the close() method was not invoked.

This commit resolves the issue by ensuring that close() is called, which properly shuts down the threads and prevents memory from leaking.

dudulightricks commented 2 months ago

@JackCaoG

will-cromar commented 1 month ago

LGTM. Please address the linter complaints, then we can merge this

dudulightricks commented 1 month ago

@will-cromar Thanks. Fixed.

dudulightricks commented 1 month ago

@will-cromar Merging?