real-stanford / cloth-funnels

[ICRA 2023] This repository contains code for training and evaluating Cloth Funnels in simulation for Ubuntu 18.04.
32 stars 5 forks source link

Evaluate Cloth Funnels Error #4

Open zcswdt opened 1 year ago

zcswdt commented 1 year ago

I installed the basic environment according to the tutorial in the code and successfully installed it. However, during the evaluation process, I encountered the following error. My system is ubuntu18.04, with a graphics card of 2070-8GB.

(cloth-funnels) zcs@zcs:~/work/github/cloth-funnels$ python cloth_funnels/run_sim.py name="longsleeve-eval" load=../models/longsleeve_canonicalized_alignment.pth eval_tasks=../assets/tasks/multi-longsleeve-eval.hdf5 eval=True num_processes=1 episode_length=10 wandb=disabled fold_finish=True dump_visualizations=True 2023-08-16 16:08:33,304 ERROR services.py:1169 -- Failed to start the dashboard , return code 1 2023-08-16 16:08:33,304 ERROR services.py:1194 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is. 2023-08-16 16:08:33,304 ERROR services.py:1238 -- The last 20 lines of /tmp/ray/session_2023-08-16_16-08-31_639412_8491/logs/dashboard.log (it contains the error message from the dashboard): from opencensus.common.transports import sync File "/home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/opencensus/common/transports/sync.py", line 16, in from opencensus.trace import execution_context File "/home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/opencensus/trace/init.py", line 15, in from opencensus.trace.span import Span File "/home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/opencensus/trace/span.py", line 32, in from opencensus.trace import status as status_module File "/home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/opencensus/trace/status.py", line 15, in from google.rpc import code_pb2 File "/home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/google/rpc/code_pb2.py", line 47, in _descriptor.EnumValueDescriptor( File "/home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 796, in new _message.Message._CheckCalledFromGeneratedFile() TypeError: Descriptors cannot not be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are:

  1. Downgrade the protobuf package to 3.20.x or lower.
  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates 2023-08-16 16:08:33,363 INFO worker.py:1553 -- Started a local Ray instance. SEEDING WITH 0 [Network] Initializing with inputs: rgb_pos [Network] Initializing factorized network [Network] Giving deformable network positional encoding (raylet) [2023-08-16 16:08:34,138 E 8788 8900] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when (raylet) - The version of grpcio doesn't follow Ray's requirement. Agent can segfault with the incorrect grpcio version. Check the grpcio version pip freeze | grep grpcio. (raylet) - The agent failed to start because of unexpected error or port conflict. Read the log cat /tmp/ray/session_latest/dashboard_agent.log. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. (raylet) - The agent is killed by the OS (e.g., out of memory). [Network Setup] Load checkpoint specified ../models/longsleeve_canonicalized_alignment.pth [Network Setup] Action Exploration Probability: 1.0141e-03 [Network Setup] Value Exploration Probability: 1.0284e-06 [Network Setup] Train Steps: 6216 Replay Buffer path: ../experiments/08-16-1608-longsleeve-eval/replay_buffer.hdf5 Error executing job with overrides: ['name=longsleeve-eval', 'load=../models/longsleeve_canonicalized_alignment.pth', 'eval_tasks=../assets/tasks/multi-longsleeve-eval.hdf5', 'eval=True', 'num_processes=1', 'episode_length=10', 'wandb=disabled', 'fold_finish=True', 'dump_visualizations=True'] Traceback (most recent call last): File "/home/zcs/work/github/cloth-funnels/cloth_funnels/runsim.py", line 114, in main envs, = setup_envs(dataset=dataset_path, **args) File "/home/zcs/work/github/cloth-funnels/cloth_funnels/utils/utils.py", line 133, in setup_envs print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"])) File "/home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/os.py", line 679, in getitem raise KeyError(key) from None KeyError: 'CUDA_VISIBLE_DEVICES'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

alpercanberk commented 1 year ago

Hi, thanks for trying out our codebase. I have a few suggestions:

  1. Have you set the CUDA_VISIBLE_DEVICES environment parameter e.g. CUDA_VISIBLE_DEVICES=1,2 python run_sim.py etc.
  2. I vaguely remember running into an issue like this. You may have to reinstall protobuf/install a different version, or reinstall ray completely.
zcswdt commented 1 year ago

Hi, thanks for trying out our codebase. I have a few suggestions:

  1. Have you set the CUDA_VISIBLE_DEVICES environment parameter e.g. CUDA_VISIBLE_DEVICES=1,2 python run_sim.py etc.
  2. I vaguely remember running into an issue like this. You may have to reinstall protobuf/install a different version, or reinstall ray completely.

Thank you very much. I have already passed the test, but it will prompt 'Out of tasks' at the end. May I ask what is the reason for this

(cloth-funnels) zcs@zcs:~/work/github/cloth-funnels$ python cloth_funnels/run_sim.py name="demo-single" load=../models/longsleeve_canonicalized_alignment.pth eval_tasks=../assets/tasks/longsleeve-single.hdf5 eval=True num_processes=1 episode_length=10 wandb=disabled fold_finish=True dump_visualizations=True 2023-08-17 10:36:06,182 INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 SEEDING WITH 0 [Network] Initializing with inputs: rgb_pos [Network] Initializing factorized network [Network] Giving deformable network positional encoding [Network Setup] Load checkpoint specified ../models/longsleeve_canonicalized_alignment.pth [Network Setup] Action Exploration Probability: 1.0141e-03 [Network Setup] Value Exploration Probability: 1.0284e-06 [Network Setup] Train Steps: 6216 Replay Buffer path: ../experiments/08-17-1036-demo-single/replay_buffer.hdf5 CUDA_VISIBLE_DEVICES: 0 (TaskLoader pid=9565) [TaskLoader] Loading eval tasks (SimEnv pid=9566) /home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/pytorch_lightning/utilities/migration/migration.py:201: PossibleUserWarning: You have multiple ModelCheckpoint callback states in this checkpoint, but we found state keys that would end up colliding with each other after an upgrade, which means we can't differentiate which of your checkpoint callbacks needs which states. At least one of your ModelCheckpoint callbacks will not be able to reload the state. (SimEnv pid=9566) rank_zero_warn( (SimEnv pid=9566) Lightning automatically upgraded your loaded checkpoint from v1.4.0 to v2.0.1. To apply the upgrade to your files permanently, run python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../models/keypoint_model.ckpt (SimEnv pid=9566) /home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. (SimEnv pid=9566) warnings.warn( (SimEnv pid=9566) /home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=None. (SimEnv pid=9566) warnings.warn(msg) (TaskLoader pid=9565) [TaskLoader] 1/2 (SimEnv pid=9566) /home/zcs/miniconda3/envs/cloth-funnels/lib/python3.9/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). (SimEnv pid=9566) warnings.warn( [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 1.6523663997650146 with #obs: 1 [RunSim] Points per hour: 0.0 [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 0.31461071968078613 with #obs: 1 [RunSim] Points per hour: 0.0 [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 0.3144204616546631 with #obs: 1 [RunSim] Points per hour: 0.0 [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 0.31615495681762695 with #obs: 1 [RunSim] Points per hour: 0.0 [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 0.3159749507904053 with #obs: 1 [RunSim] Points per hour: 0.0 [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 0.3170597553253174 with #obs: 1 [RunSim] Points per hour: 0.0 [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 0.316359281539917 with #obs: 1 [RunSim] Points per hour: 0.0 [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 0.3157174587249756 with #obs: 1 [RunSim] Points per hour: 0.0 [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 0.31556272506713867 with #obs: 1 [RunSim] Points per hour: 0.0 [RunSim] Stepping env Starting policy.act() [Policy] Forward took: 0.31675195693969727 with #obs: 1 (SimEnv pid=9566) [SimEnv] Folding cloth (SimEnv pid=9566) [SimEnv] Folding done (SimEnv pid=9566) [Episode video dumped] (SimEnv pid=9566) [Memory] Dumping memory to ../experiments/08-17-1036-demo-single/replay_buffer.hdf5 (TaskLoader pid=9565) [TaskLoader] 2/2 (TaskLoader pid=9565) [TaskLoader] Out of tasks