Open CriDora opened 3 months ago
Thanks for letting us know the issue. We have located the problem, to fix it you could set line 437 in https://github.com/open-mmlab/Amphion/blob/main/models/tts/valle_v2/base_trainer.py from "epoch_sum_loss += loss" to "epoch_sum_loss += loss.item()". I'll create a pr on this, thanks!
From: CriDora @.> Sent: Wednesday, August 14, 2024 22:38 To: open-mmlab/Amphion @.> Cc: Subscribed @.***> Subject: [open-mmlab/Amphion] [Help]: The training memory usage of valle_v2 on libritts dataset train-360 and train-100 increases. (Issue #263)
Why does the CPU memory usage increase after each training epoch? As a result, I have to stop and resume the checkpoint training after several epochs. Is it because of {train : dataloader : "persistent_workers" : true} in the configuration file?
— Reply to this email directly, view it on GitHubhttps://github.com/open-mmlab/Amphion/issues/263, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AX6GUH7C527XPXAJNRVMVU3ZRNTYHAVCNFSM6AAAAABMQP3LDKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ3DMMBUGU4TGOA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Why does the CPU memory usage increase after each training epoch? As a result, I have to stop and resume the checkpoint training after several epochs. Is it because of {train : dataloader : "persistent_workers" : true} in the configuration file?