Error on running ReNN - Githubissues

harshakokel commented 4 years ago

Hello,

I am trying to follow the ReNN doc to replicate the results and I am getting the following error on running mpirun -np 35 python examples/relationalrl/train_pickandplace1.py

Traceback (most recent call last):
  File "examples/relationalrl/train_pickandplace1.py", line 302, in <module>
    python_cmd=F"mpirun --allow-run-as-root -np {num_parallel_processes} python"
  File "-----/rlkit-relational-master/rlkit/launchers/launcher_util.py", line 590, in run_experiment
    **run_experiment_kwargs
  File "-----/rlkit-relational-master/rlkit/launchers/launcher_util.py", line 168, in run_experiment_here
    return experiment_function(variant)
  File "examples/relationalrl/train_pickandplace1.py", line 190, in experiment
    **variant['algo_kwargs']
  File "-----/rlkit-relational-master/rlkit/torch/her/her.py", line 224, in __init__
    TwinSAC.__init__(self, *args, **kwargs, **tsac_kwargs)
  File "-----/rlkit-relational-master/rlkit/torch/sac/twin_sac.py", line 152, in __init__
    self._sync_optimizers()
  File "-----/rlkit-relational-master/rlkit/torch/sac/twin_sac.py", line 465, in _sync_optimizers
    self.alpha_optimizer.sync()
AttributeError: 'HerTwinSAC' object has no attribute 'alpha_optimizer'
-----/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
Traceback (most recent call last):
  File "examples/relationalrl/train_pickandplace1.py", line 302, in <module>
    python_cmd=F"mpirun --allow-run-as-root -np {num_parallel_processes} python"
  File "-----/rlkit-relational-master/rlkit/launchers/launcher_util.py", line 590, in run_experiment
    **run_experiment_kwargs
  File "-----/rlkit-relational-master/rlkit/launchers/launcher_util.py", line 168, in run_experiment_here
    return experiment_function(variant)
  File "examples/relationalrl/train_pickandplace1.py", line 190, in experiment
    **variant['algo_kwargs']
  File "-----/rlkit-relational-master/rlkit/torch/her/her.py", line 224, in __init__
    TwinSAC.__init__(self, *args, **kwargs, **tsac_kwargs)
  File "-----/rlkit-relational-master/rlkit/torch/sac/twin_sac.py", line 152, in __init__
    self._sync_optimizers()
  File "-----/rlkit-relational-master/rlkit/torch/sac/twin_sac.py", line 465, in _sync_optimizers
    self.alpha_optimizer.sync()
AttributeError: 'HerTwinSAC' object has no attribute 'alpha_optimizer'
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[27423,1],4]
  Exit code:    1
--------------------------------------------------------------------------

Can you help me figure out what am I doing wrong?

richardrl commented 4 years ago

My guess is this line is set to false: https://github.com/richardrl/rlkit-relational/blob/e986f15a21e9ee54d03eea654f55c4e587cb9263/rlkit/torch/sac/sac.py#L107

Set a breakpoint at that line to check. If it's false, you should make sure the class is initialized with the "use_automatic_entropy_tuning=True".

harshakokel commented 4 years ago

Added breakpoint in sac.py. Turns out sac was not being used. But that led me to look into use_automatic_entropy_tuning intwin_sac.py. Figured out that MPI was not installed in my python env.

https://github.com/richardrl/rlkit-relational/blob/e986f15a21e9ee54d03eea654f55c4e587cb9263/rlkit/torch/sac/twin_sac.py#L104

harshakokel commented 4 years ago

After running the mpirun -np 35 python examples/relationalrl/train_pickandplace1.py for a few hours, logs stopped updating. I can see that all the processes are still running on my machine. But I do not see any log updates. Is this an expected behavior? I did not find any errors in the logs.

04_12.txt

harshakokel commented 4 years ago

After running the mpirun -np 35 python examples/relationalrl/train_pickandplace1.py for a few hours, logs stopped updating. I can see that all the processes are still running on my machine. But I do not see any log updates. Is this an expected behavior? I did not find any errors in the logs.

04_12.txt

Nevermind, it turns out this was just a problem with nohup command.

Details here : nohup-does-not-work-mpirun

richardrl commented 4 years ago

@harshakokel Excellent! Please let me know what results you are able to get and if you have further any further issues.

richardrl / rlkit-relational

Error on running ReNN #6