Closed Trinkle23897 closed 4 years ago
I'm having the same issue. I haven't fixed it yet.
I attached torch.autograd.detect_anomaly()
to the main training loop, and it detected a NaN error in the value prediction network of SAC here.
Since you later mask the NaNs and Infs, it's hard to say if that's related or not. It seems unlikely with the gradient clipping, that the forward models would completely and suddenly diverge/collapse. It seems like more likely that there is an issue with the NaN/Inf masking -- and we're predicting next state from a NaN/Inf state.
What confuses me is that there is a check and a warning for this in imagination.py
, but no way to safely handle it.
I have experienced this issue only once before and after investigating I convinced myself that it is due to "bad luck" in the warm-up data, so I am surprised that you experienced that you say that "almost each of the experiment would crash". The underlying problem is that the policy exploits the models "dreaming" absurdly high rewards, which leads, presumably, to huge V/Q-values and infs/nans, eventually. If we use verbosity=3
, you can see something like that:
13:09:46 | INFO | episode | step_reward. mean: 24353.95 +- 150840.66 [-2.67, 1368097.62]
13:09:46 | INFO | episode | step_reward. mean: 28201.39 +- 190058.91 [-2.59, 1855795.25]
13:09:46 | INFO | episode | step_reward. mean: 35032.18 +- 254233.11 [-3.30, 2684573.50]
13:09:46 | INFO | episode | step_reward. mean: 23426.37 +- 140640.00 [-3.38, 1202750.00]
13:09:46 | INFO | episode | step_reward. mean: 33987.25 +- 220136.44 [-3.05, 2216282.00]
13:09:47 | INFO | episode | step_reward. mean: 51998.88 +- 368336.62 [-2.80, 3857665.00]
13:09:47 | INFO | episode | step_reward. mean: 84999.31 +- 573147.12 [-2.47, 5745555.00]
13:09:47 | INFO | episode | step_reward. mean: 121692.67 +- 871324.50 [-2.70, 8948482.00]
13:09:47 | INFO | episode | step_reward. mean: 125470.11 +- 982328.94 [-127924.93, 10083945.00]
13:09:48 | INFO | episode | step_reward. mean: 187612.44 +- 1410389.38 [-2.86, 14323131.00]
13:09:48 | INFO | episode | step_reward. mean: 287003.62 +- 1920799.75 [-2.76, 15664406.00]
13:09:48 | INFO | episode | step_reward. mean: 278497.91 +- 1884108.25 [-2.89, 15320821.00]
13:09:48 | INFO | episode | step_reward. mean: 334041.50 +- 2474476.00 [-2.59, 26274620.00]
I am sure that playing with the hyperparams should prevent that from happening. I think, increasing the warm up to 1024 samples or increasing the policy_alpha slightly should already help.
To be precise, the code is this repo is not 100% the same code that was used to run the experiments for the paper. This one contains one bug fix. In the original code we were making one spurious tanh(action)
, effectively reducing the [-1,1] action space of the environment. (This was affecting all the algorithms, so it didn't matter for the relative comparison between them described in the paper; that is why we did not need to redo the experiments and did not hit the NaNs issue.) But as a side-effect, this bug, apparently, was also limiting the possible exploitation of the model.
If you wish to exactly reproduce the results in the paper, here is the full diff to apply:
diff repo_max/imagination.py paper_max/imagination.py
42c42
< next_state_means, next_state_vars = self.model.forward_all(self.states, actions) # shape: (n_actors, ensemble_size, d_state)
---
> next_state_means, next_state_vars = self.model.forward_all(self.states, actions) # shape: both (ensemble_size, n_actors, d_action)
diff repo_max/main.py paper_max/main.py
130c130
< use_best_policy = False # execute the best policy or the last one
---
> use_best_policy = False # transfer the main exploration buffer as off-policy samples to SAC
251d250
< loss.backward()
252a252
> loss.backward()
269,270d268
< if verbosity >= 2:
< _log.info(f'epoch: {epoch_i:3d} training_loss: {tr_loss:.2f}')
376c374
< # to be fair to reactive methods, clear real env data in SAC buffer, to prevent further gradient updates from it.
---
> # to be fair to reactive methods, clear real env data in buffer, to prevent further gradient updates from it
385c383
< ep_return = agent.episode(env=mdp, warm_up=warm_up, verbosity=verbosity, _log=_log)
---
> ep_return = agent.episode(env=mdp, warm_up=warm_up)
419,420c417
< @ex.capture
< def transition_novelty(state, action, next_state, model, renyi_decay):
---
> def transition_novelty(state, action, next_state, model):
427c424
< measure = JensenRenyiDivergenceUtilityMeasure(decay=renyi_decay)
---
> measure = JensenRenyiDivergenceUtilityMeasure(decay=0.1)
433c430
< def evaluate_task(env, model, buffer, task, render, filename, record, save_eval_agents, verbosity, _run, _log):
---
> def evaluate_task(env, model, buffer, task, render, filename, record, save_eval_agents, _run):
451,452c448
< n = transition_novelty(state, action, next_state, model=model)
< novelty.append(n)
---
> novelty.append(transition_novelty(state, action, next_state, model=model))
455,456d450
< if verbosity >= 3:
< _log.info(f'reward: {reward:5.2f} trans_novelty: {n:5.2f} action: {action}')
479,486d472
< # Uncomment for exploration coverage in ant
< #from envs.ant import rate_buffer
< #coverage = rate_buffer(buffer=buffer)
< #_run.log_scalar("coverage", coverage, step_num)
< #_run.result = coverage
< #_log.info(f"coverage: {coverage}")
< #return coverage
<
688,689d673
< checkpoint(buffer=buffer, step_num=n_exploration_steps)
<
698a683
>
diff repo_max/models.py paper_max/models.py
129a130,131
> actions = torch.tanh(actions)
>
Only in repo_max: readme.md
diff repo_max/sac.py paper_max/sac.py
32d31
< self.ptr = 0
33a33
> self.ptr = 0
39d38
< self.buffer_full = False
51d49
< self.buffer_full = False
64,67d61
<
< # skip ones with NaNs and Infs
< skip_mask = danger_mask(states) + danger_mask(actions) + danger_mask(rewards) + danger_mask(next_states)
< include_mask = (skip_mask == 0)
69d62
< n_samples = torch.sum(include_mask).item()
73d65
< self.buffer_full = True
76c68,72
< j = self.ptr + n_samples
---
>
> # skip ones with NaNs and Infs
> skip_mask = danger_mask(states) + danger_mask(actions) + danger_mask(rewards) + danger_mask(next_states)
> include_mask = (skip_mask == 0)
> j = self.ptr + torch.sum(include_mask).item()
87c83
< idxs = np.random.randint(len(self), size=batch_size)
---
> idxs = np.random.randint(self.ptr, size=batch_size)
94,98d89
< def __len__(self):
< if self.buffer_full:
< return self.size
< return self.ptr
<
303c294
< def episode(self, env, warm_up=False, train=True, verbosity=0, _log=None):
---
> def episode(self, env, warm_up=False, train=True):
318,320d308
< if verbosity >= 3 and _log is not None:
< _log.info(f'step_reward. mean: {torch.mean(rewards).item():5.2f} +- {torch.std(rewards).item():.2f} [{torch.min(rewards).item():5.2f}, {torch.max(rewards).item():5.2f}]')
<
diff repo_max/wrappers.py paper_max/wrappers.py
13c13
< action = np.clip(action, -1., 1.)
---
> action = np.tanh(action)
Thanks! Actually, my following runs do not collapse. Just have more tries.
This repo's code has the probability for the crash. After setting verbosity=3
the reward seems not converge. Could you please update the final version to the master branch?
11:01:07 | INFO | episode | step_reward. mean: 86085883396096.00 +- 829155918741504.00 [-298888250523648.00, 9337457543741440.00]
11:01:07 | INFO | episode | step_reward. mean: -21670574161920.00 +- 438323323600896.00 [-4848044193349632.00, 699136185729024.00]
11:01:07 | INFO | episode | step_reward. mean: -248891307982848.00 +- 3394395332149248.00 [-38263421258432512.00, 1774258237734912.00]
11:01:07 | INFO | episode | step_reward. mean: -143906335358976.00 +- 2610979338715136.00 [-28923108635181056.00, 4401228008128512.00]
11:01:07 | INFO | episode | step_reward. mean: 1921059548823552.00 +- 20858781403447296.00 [-1080617462661120.00, 235933342527127552.00]
11:01:07 | INFO | episode | step_reward. mean: 2565822589435904.00 +- 26671938034204672.00 [-2312866195570688.00, 301287592127627264.00]
11:01:08 | INFO | act | ep: 63 average step return: 41819492609799.95
/data/git/max/imagination.py:49: UserWarning: NaN in sampled next states!
warnings.warn("NaN in sampled next states!")
11:01:08 | ERROR | _emit_failed | Failed after 0:02:29!
Traceback (most recent calls WITHOUT Sacred internals):
File "main.py", line 740, in main
return do_max_exploration()
File "main.py", line 645, in do_max_exploration
average_performance = evaluate_tasks(buffer=buffer, step_num=step_num)
File "main.py", line 496, in evaluate_tasks
ep_return, ep_novelty = evaluate_task(env=env, model=model, buffer=buffer, task=task, render=render, filename=filename)
File "main.py", line 448, in evaluate_task
action, mdp, agent, _ = act(state=state, agent=agent, mdp=mdp, buffer=buffer, model=model, measure=task.measure, mode='exploit')
File "main.py", line 385, in act
ep_return = agent.episode(env=mdp, warm_up=warm_up, verbosity=verbosity, _log=_log)
File "/data/git/max/sac.py", line 317, in episode
self.replay.add(states, actions, rewards, next_states)
File "/data/git/max/sac.py", line 82, in add
self.masks[i:j] = masks
RuntimeError: The expanded size of the tensor (0) must match the existing size (128) at non-singleton dimension 0. Target sizes: [0, 1]. Tensor sizes: [128, 1]
To see whether this is really happening, I executed 4 runs with the current master with more warmup steps for higher stability. I used the following command:
main.py with max_explore env_noise_stdev=0.02 n_warm_up_steps=1024
and found no NaN problems.
If somebody still has any problems, use the current branch and provide the config seeds used, please.
I ran the experiments several times, and almost each of the experiment would crash when the NaN was sampled.
The script I use is
python3 main.py with max_explore env_noise_stdev=0.02
.And some of the logs:
I don't know what's going wrong. Could you please help me?