pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.5k stars 608 forks source link

Too much memory was cost for eval and train #3211

Closed AnnaTrainingG closed 3 months ago

AnnaTrainingG commented 4 months ago

🐛 Bug description

Problem: when I set persistent_worker = True in datalaoder, The memory cost of model will increase 2 times eg: the model is training 344G CPU memory was token, after one epoch, use _run_once_on_dataset for eval, 630G memory of CPU was costed, and the memory will never released after eval. so it'll be OOM when memory size of GPU is very small

image

Environment

vfdev-5 commented 4 months ago

@niuliling123 thanks for reporting this issue! Do you think it could be possible to get a minimal repro code for that? At first glance, persistent_worker is related to pytorch dataloader and it may be possible that ignite has little impact on that. I would like to figure out whether it is pytorch or ignite problem.

AnnaTrainingG commented 3 months ago

I got it , thanks