hi~ Thanks for your great work!
I meet a problem when I run the scrpts test_trajectory_calvin.sh, but I have change th "ngpu" to 1 when I run this script, it seems like something wrong about daraprocessing?
This is the error:
Exception: Unable to add DataPipe function name sharding_filter as it is already taken
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 54756) of binary: /home/jaylonw42/.conda/envs/3d_diffuser_actor/bin/python
Traceback (most recent call last):
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
hi~ Thanks for your great work! I meet a problem when I run the scrpts test_trajectory_calvin.sh, but I have change th "ngpu" to 1 when I run this script, it seems like something wrong about daraprocessing? This is the error: Exception: Unable to add DataPipe function name sharding_filter as it is already taken ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 54756) of binary: /home/jaylonw42/.conda/envs/3d_diffuser_actor/bin/python
Traceback (most recent call last): File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jaylonw42/.conda/envs/3d_diffuser_actor/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
online_evaluation_calvin/evaluate_policy.py FAILED
Failures: