Takes more than 12 hours for training VAE Model

kessler-frost commented 6 years ago

I've used a 24 vCPU 220 GB RAM and 200 GB hard drive with 4 P100 GPUs in order to run gpu_bash.bash process. Even after training the vae model for more than 12 hours, the training doesn't end and the steps completed are 37500 with loss at 37.4, recon_loss at 5.4, kl_loss at 32. I removed the line os.environ["CUDA_VISIBLE_DEVICES"]="0" in vae_train.py so that multiple GPUs be used.

hardmaru commented 6 years ago

Hi Sankalp

Thanks for the details. I need to check the epochs and steps in the repo

Can you let me know whether you are training the doom experiment or car racing exp?

Also did it finish in the end?

Thanks

On Tue, Jul 17, 2018 at 11:12 AM Sankalp Sanand notifications@github.com wrote:

I've used a 24 vCPU 220 GB RAM and 200 GB hard drive with 4 P100 GPUs in order to run gpu_bash.bash process. Even after training the vae model for more than 12 hours, the training doesn't end and the steps completed are 37500 with loss at 37.4, recon_loss at 5.4, kl_loss at 32. I removed the line os.environ["CUDA_VISIBLE_DEVICES"]="0" in vae_train.py so that multiple GPUs be used.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hardmaru/WorldModelsExperiments/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/AGBoHknvYYtBkFKtED1Vyhzs5dJuxuZSks5uHUf8gaJpZM4VSEf8 .

kessler-frost commented 6 years ago

I did it for the doom experiment, and no it didn't stop I had to CTRL-C the script. Also, if possible can you add the loss, recon_loss ,kl_loss, etc. to the blog or in a text file that you yourself were able to achieve. It would be beneficial for us to get an idea whether we should stop the training to get somewhat acceptable results. Thanks!

hardmaru commented 6 years ago

I’ll take a look at this. I’m currently travelling all week so I might get back to you sometime later this month when I can look at this issue.

On Tue, Jul 17, 2018 at 1:22 PM Sankalp Sanand notifications@github.com wrote:

I did it for the doom experiment, and no it didn't stop I had to CTRL-C the script. Also, if possible can you add the loss, recon_loss ,kl_loss, etc. to the blog or in a text file that you yourself were able to achieve. It would be beneficial for us to get an idea whether we should stop the training to get somewhat acceptable results. Thanks!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/hardmaru/WorldModelsExperiments/issues/4#issuecomment-405455124, or mute the thread https://github.com/notifications/unsubscribe-auth/AGBoHqyshmcr59CFgQwygP10OSWQM8m8ks5uHWaAgaJpZM4VSEf8 .

zmonoid commented 6 years ago

@kessler-frost Isn't reconstruction loss be 5.4 a good result? By default it train for 11 epoch, and got reconstruction loss around 15.x; the 32.x KL loss is due to the kl_tolerance. For me it takes 3 hours to finish vae training.

kessler-frost commented 6 years ago

@zmonoid that is what I was referring to. I didn't know at which point, or loss, should I stop this. Were you able to complete for the whole of 11 epochs? Because mine just went on for a while, it was running whole night and the steps just kept on increasing seemingly without any end. What was the instance configuration that you used?

zmonoid commented 6 years ago

For me no problem. Maybe you have modified the number of epochs for training in vae_train.py?

kessler-frost commented 6 years ago

Nope, didn't change any file. Moreover, I won't be able to reproduce this error because I've burned through all of my credits. So, I guess until somebody else is also able to train without any issues or is able to reproduce this issue I should close it?

zmonoid commented 6 years ago

@kessler-frost I also have problem in reproducing this paper, but not this part. I will release my code once it is resolved.

hardmaru commented 6 years ago

Hi All,

I re-ran gpu_jobs.bash on a fresh P100 machine, and fresh clone of the repo (after running the data generation script and copying the .npz files to /record), and the whole thing finished between 7-8 hours.

Below is the time log, and also the log of the training progress of both vae and rnn, as requested:

https://github.com/hardmaru/WorldModelsExperiments/blob/master/doomrnn/trainlog/gpu_jobs.log.txt

From here, I'm not sure why it is not working specifically on your machine setup. I've listed the precise versions of what I used in the blog post:

http://blog.otoro.net/2018/06/09/world-models-experiments/

kessler-frost commented 6 years ago

@hardmaru thanks, I guess I will close this issue since it seems a specific case (might be due to dependency conflicts with Anaconda's python).

hardmaru commented 6 years ago

Cool. In the future if I have some free time I might try to log the exact setup of the google cloud virtual machine instance or to do a docker thing ...

On Sat, Jul 21, 2018 at 3:28 PM Sankalp Sanand notifications@github.com wrote:

Closed #4 https://github.com/hardmaru/WorldModelsExperiments/issues/4.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/hardmaru/WorldModelsExperiments/issues/4#event-1745978887, or mute the thread https://github.com/notifications/unsubscribe-auth/AGBoHkTVrvjJpCDFx4QtcdBFn_lN1J9pks5uIsojgaJpZM4VSEf8 .

hardmaru / WorldModelsExperiments

Takes more than 12 hours for training VAE Model #4