mees / calvin

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
http://calvin.cs.uni-freiburg.de
MIT License
401 stars 57 forks source link

Training error && multi GPU error #63

Open 2488583886 opened 11 months ago

2488583886 commented 11 months ago

Let me introduce some problem I encountered and the methods I used to try to solve it.

Environment:

aiosignal 1.3.1

antlr4-python3-runtime 4.8

appdirs 1.4.4

async-timeout 4.0.3

attrs 23.1.0

Calvin 0.0.1 /fuyujie/calvin/calvin_models

calvin-env 0.0.1 /fuyujie/calvin/calvin_env

certifi 2023.11.17

cffi 1.16.0

charset-normalizer 3.3.2

click 8.1.7

cloudpickle 3.0.0

cmake 3.18.4

colorlog 6.7.0

contourpy 1.1.1

cycler 0.12.1

decorator 4.4.2

docker-pycreds 0.4.0

filelock 3.13.1

fonttools 4.45.1

freetype-py 2.4.0

frozenlist 1.4.0

fsspec 2023.10.0

gitdb 4.0.11

GitPython 3.1.40

gym 0.26.2

gym-notices 0.0.8

huggingface-hub 0.19.4

hydra-colorlog 1.2.0

hydra-core 1.1.1

idna 3.6

imageio 2.33.0

imageio-ffmpeg 0.4.9

importlib-metadata 6.8.0

importlib-resources 6.1.1

joblib 1.3.2

kiwisolver 1.4.5

lightning-lite 1.8.6

lightning-utilities 0.10.0

llvmlite 0.41.1

lxml 4.9.3

markdown-it-py 3.0.0

matplotlib 3.7.4

mdurl 0.1.2

moviepy 1.0.3

MulticoreTSNE 0.1

multidict 6.0.4

networkx 2.2

nltk 3.8.1

numba 0.58.1

numpy 1.24.4

numpy-quaternion 2022.4.3

nvidia-cublas-cu11 11.10.3.66

nvidia-cuda-nvrtc-cu11 11.7.99

nvidia-cuda-runtime-cu11 11.7.99

nvidia-cudnn-cu11 8.5.0.96

omegaconf 2.1.2

opencv-python 4.8.1.78

packaging 23.2

pandas 2.0.3

Pillow 10.1.0

pip 23.3.1

plotly 5.18.0

proglog 0.1.10

protobuf 4.25.1

psutil 5.9.6

pybullet 3.2.5

pycollada 0.6

pycparser 2.21

pyglet 2.0.10

Pygments 2.17.2

pyhash 0.9.3

PyOpenGL 3.1.0

pyparsing 3.1.1

pyrender 0.1.45

python-dateutil 2.8.2

pytorch-lightning 1.8.6

pytz 2023.3.post1

PyYAML 6.0.1

regex 2023.10.3

requests 2.31.0

rich 13.7.0

safetensors 0.4.0

scikit-learn 1.3.2

scipy 1.10.1

sentence-transformers 2.2.2

sentencepiece 0.1.99

sentry-sdk 1.37.1

setproctitle 1.3.3

setuptools 57.5.0

six 1.16.0

smmap 5.0.1

tacto 0.0.3 /fuyujie/calvin/calvin_env/tacto

tenacity 8.2.3

tensorboardX 2.6.2.2

termcolor 2.3.0

threadpoolctl 3.2.0

tokenizers 0.15.0

torch 1.13.1

torchmetrics 1.2.0

torchvision 0.14.1

tqdm 4.66.1

transformers 4.35.2

trimesh 4.0.5

typing_extensions 4.8.0

tzdata 2023.3

urdfpy 0.0.22

urllib3 2.1.0

wandb 0.16.0

wheel 0.41.2

yarl 1.9.3

zipp 3.17.0


- ​command:  python training.py datamodule.root_data_dir=/fuyujie/calvin/dataset/**calvin_debug_dataset** datamodule/datasets=vision_lang_shm

  ### 1.Wandb error

  #### Error:

[2023-12-04 01:09:51,860][calvin_env.envs.play_table_env][INFO] - Using calvin_env with commit 1431a46bd36bde5903fb6345e68b5ccc30def666. [2023-12-04 01:09:51,861][calvin_agent.wrappers.calvin_env_wrapper][INFO] - Initialized PlayTableEnv for device cuda:0 [2023-12-04 01:09:51,876][calvin_agent.evaluation.multistep_sequences][INFO] - Start generating evaluation sequences. [2023-12-04 01:10:07,176][calvin_agent.evaluation.multistep_sequences][INFO] - Done generating evaluation sequences. [2023-12-04 01:10:07,180][calvin_agent.models.mcil][INFO] - Start validation epoch 0 Exception in thread IntMsgThr: Traceback (most recent call last): File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run self._target(*self._args, self._kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages self._loop_check_status( File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status local_handle = request() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 766, in deliver_internal_messages return self._deliver_internal_messages(internal_message) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 490, in _deliver_internal_messages return self._deliver_record(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record handle = mailbox._deliver_record(record, interface=self) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record interface._publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request Exception in thread NetStatThr: self._send_message(msg) Traceback (most recent call last): File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner self._sendall_with_error_handle(header + data) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) self.run() BrokenPipeError: [Errno 32] Broken pipe File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status self._loop_check_status( File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status local_handle = request() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 758, in deliver_network_status return self._deliver_network_status(status) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 484, in _deliver_network_status return self._deliver_record(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record handle = mailbox._deliver_record(record, interface=self) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record interface._publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe Exception in thread ChkStopThr: Traceback (most recent call last): File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run self._target(self._args, self._kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status self._loop_check_status( File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status local_handle = request() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 750, in deliver_stop_status return self._deliver_stop_status(status) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 468, in _deliver_stop_status return self._deliver_record(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record handle = mailbox._deliver_record(record, interface=self) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record interface._publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]Error executing job with overrides: ['datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset', 'datamodule/datasets=vision_lang_shm'] Traceback (most recent call last): File "training.py", line 68, in train trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_train self._run_sanity_check() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _run_sanity_check val_loop.run() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance output = self._evaluation_step(kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step output = self.trainer._call_strategy_hook(hook_name, kwargs.values()) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook output = fn(args, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step return self.model.validation_step(*args, kwargs) File "/fuyujie/calvin/calvin_models/calvin_agent/models/mcil.py", line 345, in validation_step else self.language_goal(dataset_batch["lang"]) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1215, in _call_impl hook_result = hook(self, input, result) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/wandb_torch.py", line 349, in after_forwardhook wandb.run.summary["graph%i" % graph_idx] = self File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 52, in setitem self.update({key: val}) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 74, in update self._update(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 128, in _update self._update_callback(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn return func(self, *args, *kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1388, in _summary_update_callback self._backend.interface.publish_summary(summary_record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 259, in publish_summary pb_summary_record = self._make_summary(summary_record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 237, in _make_summary json_value = self._summary_encode(item.value, path_from_root) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 210, in _summary_encode val_to_json(self._run, path_from_root, value, namespace="summary") File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/utils.py", line 164, in val_to_json val.bind_to_run(run, key, namespace) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/data_types.py", line 1452, in bind_to_run super().bind_to_run(args, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/base_types/media.py", line 134, in bind_to_run _datatypes_callback(media_path) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/_globals.py", line 19, in _datatypes_callback _glob_datatypes_callback(fname) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1417, in _datatypes_callback self._backend.interface.publish_files(files) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 276, in publish_files self._publish_files(files) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 378, in _publish_files self._publish(rec) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe


  #### Attempted method:

  ①Because I'm in China, I use the clash in my server. So first I guessed it's my network problem, so I try the demo in the wandb officical website, like this:
```python
import random
import wandb

wandb.login()

# Launch 5 simulated experiments
total_runs = 5
for run in range(total_runs):
  # 🐝 1️⃣ Start a new run to track this script
  wandb.init(
      # Set the project where this run will be logged
      project="basic-intro", 
      # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
      name=f"experiment_{run}", 
      # Track hyperparameters and run metadata
      config={
      "learning_rate": 0.02,
      "architecture": "CNN",
      "dataset": "CIFAR-100",
      "epochs": 10,
      })

  # This simple block simulates a training loop logging metrics
  epochs = 10
  offset = random.random() / 5
  for epoch in range(2, epochs):
      acc = 1 - 2 ** -epoch - random.random() / epoch - offset
      loss = 2 ** -epoch + random.random() / epoch + offset

      # 🐝 2️⃣ Log metrics from your script to W&B
      wandb.log({"acc": acc, "loss": loss})

  # Mark the run as finished
  wandb.finish()

And it works well

②Then I tried to modify the training.py

I commented two places about logger:

1

and it begain training successfully, but when beginning training the epoch 1(epoch 0 is good), it becomes more and more slower, and when it reaches the 100%, it sticks there permanently(at least 15 min), like this:

2

2. multi GPU error

command: python training.py datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset datamodule/datasets=vision_lang_shm trainer.devices=-1

error:

[rank: 0] Global seed set to 42

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use).

[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).

[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.

Error executing job with overrides: ['datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset', 'datamodule/datasets=vision_lang_shm', 'trainer.devices=-1']

Traceback (most recent call last):

 File "training.py", line 68, in train

  trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit

  call._call_and_handle_interrupt(

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt

  return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch

  return function(*args, **kwargs)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl

  self._run(model, ckpt_path=self.ckpt_path)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run

  self.strategy.setup_environment()

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment

  self.setup_distributed()

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed

  _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection

  torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group

  store, rank, world_size = next(rendezvous_iterator)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler

  store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store

  return TCPStore(

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

/root/miniconda3/envs/calvin/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown

 warnings.warn('resource_tracker: There appear to be %d '

Thanks so much for your attention and help!

lukashermann commented 11 months ago
  1. This doesn't seem to be caused by calvin. Did you try running wandb in dryrun? Setting the environment variable WANDB_MODE="dryrun" should turn off the sync. Alternatively you can also use the tensorboard logger by adding the argument logger=tb_logger when you start a training.

By default, there are rollout callbacks enabled which are run during the validation, this could be a reason for why it seemed like it got stuck. Try disabling all rollout callbacks by setting the arguments ~callbacks/rollout and ~callbacks/rollout_lh. I can also recommend not using the shared memory dataloader when debugging, so also set datamodule/datasets=vision_lang.

  1. This again doesn't seem to be caused by our code. Did you successfully run other PyTorch projects with distributed training using ddp?