hardmaru / WorldModelsExperiments

World Models Experiments
608 stars 171 forks source link

MemoryError on doomrnn #19

Open xiaoschannel opened 5 years ago

xiaoschannel commented 5 years ago

I am encountering what appears to be an error where the program is using too much memory:

When I load 500 episodes, the program runs fine and VAE gets trained and loss decreases. When I load 2000 episodes, I get the following:

Traceback (most recent call last):
  File "vae_train.py", line 77, in <module>
    dataset = create_dataset(dataset)
  File "vae_train.py", line 60, in create_dataset
    data = np.zeros((M, 64, 64, 3), dtype=np.uint8)
MemoryError

The repo uses 10k episodes, but I cannot load even 2k on my 16GB machine. Am I missing something? If my memory really is the issue here, what amount of memory is necessary to replicate the paper with the codes here?

hardmaru commented 5 years ago

Hi @zuoanqh

Thanks for the issue. It's due more to laziness on my part, rather than actual requirements to train the VAE.

When I was running the experiments I was doing them on virtual cloud instances that had GPUs, 64-core CPUs and a few hundred GBs of RAM, so I was lazy and just loaded the entire dataset into a numpy array (as you have outlined: data = np.zeros((M, 64, 64, 3), dtype=np.uint8)) and this dumped hundreds of GBs of data directly to RAM.

If you want to train the VAE with very little RAM, feel free to refactor the code using the more modern tf.data which will load batches from disk slowly to construct mini-batches and handle the training iteration queues.

Here are a few tutorials on how to do use tf.data:

https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428

https://www.tensorflow.org/guide/datasets

Best.

xiaoschannel commented 5 years ago

Thank you so much! I will look into that. should I send a PR if I get it to work?

On Jan 21, 2019 09:16, hardmaru notifications@github.com wrote:

Hi @zuoanqhhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fzuoanqh&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=3JIIYrFSsYggqtetT3ImBRf0TbJUU%2BcUDFH3Vm%2B3rQQ%3D&reserved=0

Thanks for the issue. It's due more to laziness on my part, rather than actual requirements to train the VAE.

When I was running the experiments I was doing them on virtual cloud instances that had GPUs, 64-core CPUs and a few hundred GBs of RAM, so I was lazy and just loaded the entire dataset into a numpy array (as you have outlined: data = np.zeros((M, 64, 64, 3), dtype=np.uint8)) and this dumped hundreds of GP of data directly to RAM.

If you want to train the VAE with very little RAM, feel free to refactor the code using the more modern tf.data which will load batches from disk slowly to construct mini-batches and handle the training iteration queues.

Here are a few tutorials on how to do use tf.data:

https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-use-dataset-in-tensorflow-c758ef9e4428&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=nbmhVslV1Bg%2Bn%2Fjplvv9mpYQiTyzuqKV48umVoUVOSI%3D&reserved=0

https://www.tensorflow.org/guide/datasetshttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.tensorflow.org%2Fguide%2Fdatasets&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=JF3kCax0TWT%2FJi5WLkcLU0u9GtIpCfJRUMm%2BHQX6BI0%3D&reserved=0

Best.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhardmaru%2FWorldModelsExperiments%2Fissues%2F19%23issuecomment-455916169&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=xL1GMIeZzoCwgNYTEc9GqonVgQFDPdCbfbsdwEcb3xk%3D&reserved=0, or mute the threadhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH-F-mwEYbbctffIwpBVFc4Daqz9n_Zsks5vFQbogaJpZM4aJmv6&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=n9McifoO7fxJUB9jrWkzsoSm6VUKf0hDdf9HbqWn9f4%3D&reserved=0.

hardmaru commented 5 years ago

Yeah, that would be great!

On Mon, Jan 21, 2019 at 9:32 AM Xiao Zeng notifications@github.com wrote:

Thank you so much! I will look into that. should I send a PR if I get it to work?

On Jan 21, 2019 09:16, hardmaru notifications@github.com wrote:

Hi @zuoanqh< https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fzuoanqh&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=3JIIYrFSsYggqtetT3ImBRf0TbJUU%2BcUDFH3Vm%2B3rQQ%3D&reserved=0>

Thanks for the issue. It's due more to laziness on my part, rather than actual requirements to train the VAE.

When I was running the experiments I was doing them on virtual cloud instances that had GPUs, 64-core CPUs and a few hundred GBs of RAM, so I was lazy and just loaded the entire dataset into a numpy array (as you have outlined: data = np.zeros((M, 64, 64, 3), dtype=np.uint8)) and this dumped hundreds of GP of data directly to RAM.

If you want to train the VAE with very little RAM, feel free to refactor the code using the more modern tf.data which will load batches from disk slowly to construct mini-batches and handle the training iteration queues.

Here are a few tutorials on how to do use tf.data:

https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428 < https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-use-dataset-in-tensorflow-c758ef9e4428&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=nbmhVslV1Bg%2Bn%2Fjplvv9mpYQiTyzuqKV48umVoUVOSI%3D&reserved=0>

https://www.tensorflow.org/guide/datasets< https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.tensorflow.org%2Fguide%2Fdatasets&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=JF3kCax0TWT%2FJi5WLkcLU0u9GtIpCfJRUMm%2BHQX6BI0%3D&reserved=0>

Best.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhardmaru%2FWorldModelsExperiments%2Fissues%2F19%23issuecomment-455916169&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=xL1GMIeZzoCwgNYTEc9GqonVgQFDPdCbfbsdwEcb3xk%3D&reserved=0>, or mute the thread< https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH-F-mwEYbbctffIwpBVFc4Daqz9n_Zsks5vFQbogaJpZM4aJmv6&data=02%7C01%7C%7C5712d7f665754685c88408d67f35b7b9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636836266022112195&sdata=n9McifoO7fxJUB9jrWkzsoSm6VUKf0hDdf9HbqWn9f4%3D&reserved=0>.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/hardmaru/WorldModelsExperiments/issues/19#issuecomment-455917415, or mute the thread https://github.com/notifications/unsubscribe-auth/AGBoHhzO4uQbwKN8JvxLrOqXnRpd_1G7ks5vFQqigaJpZM4aJmv6 .