Memory related issues encountered when training and pickling the features (on a custom dataset)

tejuafonja commented 3 years ago

Hi, thanks for the excellent work. I'm trying to train the ppg2mel model on my dataset and I ran into some problems which I describe below. I would appreciate your advice on how to resolve them.

When I set is_cache_feat=True, the process got Killed every time (whenever it tries to pickle the features after extracting them). I understand this might as a result of the CPU running out of memory when pickling the files. Did you ever encounter such a problem? If yes, how did you resolve it, please? I tried with dataset sizes of (1314, 2054, and 3359) but when I reduced the dataset size to 10, I didn't encounter the error.
For dataset sizes of (1314, 2054, and 3359), I disabled the is_cache_feat=False due to (1), however, after the features were extracted, I got CUDA out of memory error during training, the error reads CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 12.74 GiB already allocated; 4.75 MiB free; 14.80 GiB reserved in total by PyTorch. I understand that might be due to my GPU memory capacity (Tesla V100-SXM2, 15GiB). Are there some hyperparameters that I could reduce inorder to successfully train the model? I'm currently using the hyperparameters in create_hparams_stage(). When I reduced the dataset size to 10, all works well. These are the hparams used.
What should my Train loss and Grad Norm look like, please? I wasn't sure how to interpret the results gotten per epoch and thought to ask if you could share your training log file. This is my training logfile with dataset trainsize=10 and validsize=10

Again, thank you for your work. It's really interesting and a huge contribution to research.

tejuafonja commented 3 years ago

I've investigated the issue with (1) and (2) further, and can confirm that it's a problem with my resources - I've solved it by switching to a GPU with more memory while reducing the batch size on the other server as well as the dataset size. I'm closing this topic as it's not an issue with the repository itself. I'm happy to get feedback on (3) if possible [the losses are still getting smaller, and the predicted Mel spectrogram looking like that of the target, so fingers crossed].

guanlongzhao commented 3 years ago

You can follow instructions in https://github.com/guanlongzhao/fac-via-ppg#view-training-progress to visualize the training curve. You can monitor model convergency there.

tejuafonja commented 3 years ago

Thanks, Yes I’m doing this.

guanlongzhao / fac-via-ppg

Memory related issues encountered when training and pickling the features (on a custom dataset) #13