kengz / SLM-Lab

Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".
https://slm-lab.gitbook.io/slm-lab/
MIT License
1.24k stars 264 forks source link

Potential Memory Leak #452

Closed batu closed 4 years ago

batu commented 4 years ago

Hello,

I am currently using SLM lab as the learning component of my custom Unity environments. I am using a modified UnityEnv wrapper and I run my experiments using a modified version of the starter code here.

When I am running both PPO and SAC I realized that my Unix kernel kills the job after a while due running out of memory (RAM/Swap).

Given the custom nature of this bug, I don't expect you to replicate it, but rather, asking if you had ever faced a similar problem on your end.

Some more detail: 1) Initially, I assumed it was due to the size of the replay buffer. But even after the replay buffer was capped up a small number (1000) and got maxed out the problem persisted. 2) The memory increase is roughly on the order of 1mb/s which is relatively high. 3) I managed to trace it to the "train step" in SAC. Can't trace if memory is created there, but when the training steps aren't taken, there is no problem. 4) I tested with the default Unity envs to ensure I didn't cause the problem with my custom env--this doesn't seem to be the cause. 5) We will be testing with the provided Cartpole env to see if the problem persists.

Any guidance or tips would be appreciated! And once again thank you for the great library!

kengz commented 4 years ago

Unity environment consumes much more memory. Another verification you could make is to run an Atari environment and compare - this should not exhaust your memory. What's the maximum size that the RAM grows to, and the maximum RAM on your machine?

batu commented 4 years ago

I have 16 gb (+1 gb of swap) on my machine, and the python process gradually takes all of it.

The annoying bit is that I can see the python process increasing in memory, not the Unity environment-- initially, I also had thought it was the Unity side of things.

Checking atari is a good idea. Thank you!

batu commented 4 years ago
Filename: /home/batu/Desktop/The_Agency/Libraries/SLM-Lab/slm_lab/agent/algorithm/ppo.py

Line #    Mem usage    Increment   Line Contents
================================================
   171    502.4 MiB    502.4 MiB       @profile
   172                                 def train(self):
   173    502.4 MiB      0.0 MiB           if util.in_eval_lab_modes():
   174                                         return np.nan
   175    502.4 MiB      0.0 MiB           clock = self.body.env.clock
   176    502.4 MiB      0.0 MiB           if self.to_train == 1:
   177    502.4 MiB      0.0 MiB               net_util.copy(self.net, self.old_net)  # update old net
   178    512.4 MiB     10.0 MiB               batch = self.sample()
   179    512.4 MiB      0.0 MiB               clock.set_batch_size(len(batch))
   180    609.1 MiB     96.7 MiB               _pdparams, v_preds = self.calc_pdparam_v(batch)
   181    611.4 MiB      2.3 MiB               advs, v_targets = self.calc_advs_v_targets(batch, v_preds)
   182                                         # piggy back on batch, but remember to not pack or unpack
   183    611.4 MiB      0.0 MiB               batch['advs'], batch['v_targets'] = advs, v_targets
   184    611.4 MiB      0.0 MiB               if self.body.env.is_venv:  # unpack if venv for minibatch sampling
   185    611.4 MiB      0.0 MiB                   for k, v in batch.items():
   186    611.4 MiB      0.0 MiB                       if k not in ('advs', 'v_targets'):
   187    611.4 MiB      0.0 MiB                           batch[k] = math_util.venv_unpack(v)
   188    611.4 MiB      0.0 MiB               total_loss = torch.tensor(0.0)
   189   1458.7 MiB      0.0 MiB               for _ in range(self.training_epoch):
   190   1402.4 MiB      0.8 MiB                   minibatches = util.split_minibatch(batch, self.minibatch_size)
   191   1458.7 MiB      0.0 MiB                   for minibatch in minibatches:
   192   1444.6 MiB      0.0 MiB                       if self.body.env.is_venv:  # re-pack to restore proper shape
   193   1444.6 MiB      0.0 MiB                           for k, v in minibatch.items():
   194   1444.6 MiB      0.0 MiB                               if k not in ('advs', 'v_targets'):
   195   1444.6 MiB      0.0 MiB                                   minibatch[k] = math_util.venv_pack(v, self.body.env.num_envs)
   196   1444.6 MiB      0.0 MiB                       advs, v_targets = minibatch['advs'], minibatch['v_targets']
   197   1448.7 MiB      6.3 MiB                       pdparams, v_preds = self.calc_pdparam_v(minibatch)
   198   1457.7 MiB      9.7 MiB                       policy_loss = self.calc_policy_loss(minibatch, pdparams, advs)  # from actor
   199   1457.7 MiB      0.0 MiB                       val_loss = self.calc_val_loss(v_preds, v_targets)  # from critic
   200   1457.7 MiB      0.0 MiB                       if self.shared:  # shared network
   201                                                     loss = policy_loss + val_loss
   202                                                     self.net.train_step(loss, self.optim, self.lr_scheduler, clock=clock, global_net=self.global_net)
   203                                                 else:
   204   1458.7 MiB     23.7 MiB                           self.net.train_step(policy_loss, self.optim, self.lr_scheduler, clock=clock, global_net=self.global_net)
   205   1458.7 MiB      4.2 MiB                           self.critic_net.train_step(val_loss, self.critic_optim, self.critic_lr_scheduler, clock=clock, global_net=self.global_critic_net)
   206   1458.7 MiB      0.0 MiB                           loss = policy_loss + val_loss
   207   1458.7 MiB      0.0 MiB                       total_loss += loss
   208   1458.7 MiB      0.0 MiB               loss = total_loss / self.training_epoch / len(minibatches)
   209                                         # reset
   210   1458.7 MiB      0.0 MiB               self.to_train = 0
   211   1458.7 MiB      0.0 MiB               logger.debug(f'Trained {self.name} at epi: {clock.epi}, frame: {clock.frame}, t: {clock.t}, total_reward so far: {self.body.env.total_reward}, loss: {loss:g}')
   212   1458.7 MiB      0.0 MiB               return loss.item()
   213                                     else:
   214                                         return np.nan

Something very weird happens inline 188. (this is not at every iteration of course.) This is the train function of PPO.

The memory usage spikes, but obviously as a side effect of either the tensor allocation or the loop. Do you have any guesses as to what the sideeffect might be?

kengz commented 4 years ago

PPO is on-policy, so the replay buffer size is capped. In fact you will see rise and fall of RAM consumption for PPO before and after training when it clears the replay buffer. Can u also share the spec you're using? If the time horizon, num_envs and batch size are big, and when coupled with state-processing (stacked/sequential states), they will take up RAM as well.

Note that line 188's increment is actually 0MB. Line 189 actually shows the memory gain from the entire for-loop, not from line 188 of tensor assignment. This number matches up with the end of the for loop at line 207, which is 1458.7MB. So, nothing abnormal here.

kengz commented 4 years ago

closing due to inactivity, and the memory leak doesn't actually seem to be happening.