Closed rraallvv closed 5 years ago
Try smaller batch_size
or max_time_steps
. With default setting you will need ~10 GB GPU memory. I will add a mention for this to README later.
Yes,just set batch_size=1 and max_time_steps smaller,i think that you just need 4G GPU memory otherwise it will easily run out of memroy
Hi,@r9y9 i have found that when eval model,maybe better setting the requires_grad of the model to be false and donot backwad loss, otherwise the memory use will double and quickly run out of memory
I don't think it matters if you have sufficient GPU memory, but yes, ideally requires_grad
should be set to False at eval mode.
Hi, Sorry to re-open this, but I'm running into the same problem.
I'm trying to train the model on the ljspeech corpus, conditioned on mel spectrogram. I followed the instructions, didn't change anything in the presets ljspeech json. I'm training on nvidia P100 GPU, which has 16GB buffer size. I can see it fills up immediately when I start training, and fails quite quickly with out of memory exception. This is the error I get (same as the issue's opener):
`TensorBoard event log path: log/run-test2018-05-07_11:36:28.215518 0it [00:00, ?it/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 967, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 973, in
@Ola-Vish because r9y9 has update the repo for pytorch0.4, so i don't test it. but i guess that u may change this line https://github.com/r9y9/wavenet_vocoder/blob/186edcfba993223eefaec6f1a80756a0ab9e8dd3/train.py#L525 to
with torch.no_grad():
y_hat = model.incremental_forward(
may helps.
@azraelkuan Thank you for the suggestion but sadly this didn't help :( Still the same problem.
oops, I just forgot to add torch.no_grad()
. Fixed.
@Ola-Vish Could you check if some examples of https://github.com/pytorch/examples work for you? Do you think the problem is wavenet_vocoder specific?
One thing I think I might be doing wrong is that calling module.incremental_forward(x)
instead of module(x)
. I'm not quite familiar with internals of pytorch, but according to the Facebook team it seems the way we are doing now is not recommended. See https://github.com/pytorch/fairseq/commit/50fdf591464ca63940a2c1c5e7057b2f4df034f5#diff-9f76bb3e5dd085949139bba958f8aa3d. That's on my todo list for a while but I haven't done it yet since there's no problem for me until now...
@Ola-Vish i have test the code of the lastest version. i found that the memory will have a small increase during the training process, so i doubt that whether u use all the memory at first. can u show some more details? such as batch_size, used memory , total memory.
For the record, I have been training a model for two days with the latest code after merging #58 on Ubuntu 16.04 but haven't seen any issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
The error below is thrown when I try to train
wavenet_vocoder
with the default parameters on a Jupyter Notebook at Googles's Colaboratory site