Open eferreirafilho opened 7 months ago
The problem can be related to numerical instabilities in the physics simulation. Try to reduce the time step or increase the number of iterations.
If this does not work a quick is to make sure you never return a observation with nans, i.e set the nan values to 0. Training wise you can threat this as a termination condition and reset the envs that have nan values.
Hopefully the agent will learn not to enter these "nan prone"states.
Thanks for the help. It did not solve it, but I think now it's happening less.
I have added this to the end of get_observations:
if torch.isnan(self.obs_buf).any():
print("NaN found in obs_buf, replacing with zeros")
self.obs_buf = torch.where(torch.isnan(self.obs_buf), torch.zeros_like(self.obs_buf), self.obs_buf)
if torch.isinf(self.obs_buf).any():
print("Inf found in obs_buf, replacing with zeros")
self.obs_buf = torch.where(torch.isinf(self.obs_buf), torch.zeros_like(self.obs_buf), self.obs_buf)
return observations
and this to the reset conditions:
def is_done(self) -> None:
# Check if the episode has reached its maximum length
self.reset_buf = torch.where(self.progress_buf >= self._max_episode_length, torch.ones_like(self.reset_buf), self.reset_buf)
# Termination condition for NaN values in observations
# Check for any NaNs in the observations buffer
if torch.isnan(self.obs_buf).any():
print("NaN detected in observations, triggering reset.")
# Set all entries in the reset buffer to 1 to indicate reset is needed
self.reset_buf = torch.ones_like(self.reset_buf)
# handle Inf values similarly
if torch.isinf(self.obs_buf).any():
print("Inf detected in observations, triggering reset.")
self.reset_buf = torch.ones_like(self.reset_buf)
I'm using the FrankaDeformable as a basis for my own OIGE simulation. I sometimes get this error:
It seems random. Sometimes I train for 200 iterations and the error appears, sometimes it appears after thousands of iterations, and sometimes It does never appear.
The error seems to appear less when running with fewer envs. My guess is that some robot pose trigger some invalid math operation in torch and trigger the error, But I have no clue how to solve it.
I get the same error in Isaac 2023.0.1-hotfix and Isaac 2023.1.1. The same error in two different machines:
Machine 1: Kubuntu 22.04 RTX A5000 24Gb Vram NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0
Machine 2: Kubuntu 22.04 RTX 3080 16GB Vram NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3
I tried adding some checks to prevent the error, but it did not help, and I never fall in the implemented conditions:
Any ideas?
Thanks!!