open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.45k stars 379 forks source link

[Help]: The training memory usage of valle_v2 on libritts dataset train-360 and train-100 increases. #263

Open CriDora opened 1 month ago

CriDora commented 1 month ago

Why does the CPU memory usage increase after each training epoch? As a result, I have to stop and resume the checkpoint training after several epochs. Is it because of {train : dataloader : "persistent_workers" : true} in the configuration file?

jiaqili3 commented 1 month ago

Thanks for letting us know the issue. We have located the problem, to fix it you could set line 437 in https://github.com/open-mmlab/Amphion/blob/main/models/tts/valle_v2/base_trainer.py from "epoch_sum_loss += loss" to "epoch_sum_loss += loss.item()". I'll create a pr on this, thanks!


From: CriDora @.> Sent: Wednesday, August 14, 2024 22:38 To: open-mmlab/Amphion @.> Cc: Subscribed @.***> Subject: [open-mmlab/Amphion] [Help]: The training memory usage of valle_v2 on libritts dataset train-360 and train-100 increases. (Issue #263)

Why does the CPU memory usage increase after each training epoch? As a result, I have to stop and resume the checkpoint training after several epochs. Is it because of {train : dataloader : "persistent_workers" : true} in the configuration file?

— Reply to this email directly, view it on GitHubhttps://github.com/open-mmlab/Amphion/issues/263, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AX6GUH7C527XPXAJNRVMVU3ZRNTYHAVCNFSM6AAAAABMQP3LDKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ3DMMBUGU4TGOA. You are receiving this because you are subscribed to this thread.Message ID: @.***>