Open 2488583886 opened 11 months ago
WANDB_MODE="dryrun"
should turn off the sync. Alternatively you can also use the tensorboard logger by adding the argument logger=tb_logger
when you start a training. By default, there are rollout callbacks enabled which are run during the validation, this could be a reason for why it seemed like it got stuck. Try disabling all rollout callbacks by setting the arguments ~callbacks/rollout
and ~callbacks/rollout_lh
. I can also recommend not using the shared memory dataloader when debugging, so also set datamodule/datasets=vision_lang
.
Let me introduce some problem I encountered and the methods I used to try to solve it.
Environment:
aiosignal 1.3.1
antlr4-python3-runtime 4.8
appdirs 1.4.4
async-timeout 4.0.3
attrs 23.1.0
Calvin 0.0.1 /fuyujie/calvin/calvin_models
calvin-env 0.0.1 /fuyujie/calvin/calvin_env
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
cmake 3.18.4
colorlog 6.7.0
contourpy 1.1.1
cycler 0.12.1
decorator 4.4.2
docker-pycreds 0.4.0
filelock 3.13.1
fonttools 4.45.1
freetype-py 2.4.0
frozenlist 1.4.0
fsspec 2023.10.0
gitdb 4.0.11
GitPython 3.1.40
gym 0.26.2
gym-notices 0.0.8
huggingface-hub 0.19.4
hydra-colorlog 1.2.0
hydra-core 1.1.1
idna 3.6
imageio 2.33.0
imageio-ffmpeg 0.4.9
importlib-metadata 6.8.0
importlib-resources 6.1.1
joblib 1.3.2
kiwisolver 1.4.5
lightning-lite 1.8.6
lightning-utilities 0.10.0
llvmlite 0.41.1
lxml 4.9.3
markdown-it-py 3.0.0
matplotlib 3.7.4
mdurl 0.1.2
moviepy 1.0.3
MulticoreTSNE 0.1
multidict 6.0.4
networkx 2.2
nltk 3.8.1
numba 0.58.1
numpy 1.24.4
numpy-quaternion 2022.4.3
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
omegaconf 2.1.2
opencv-python 4.8.1.78
packaging 23.2
pandas 2.0.3
Pillow 10.1.0
pip 23.3.1
plotly 5.18.0
proglog 0.1.10
protobuf 4.25.1
psutil 5.9.6
pybullet 3.2.5
pycollada 0.6
pycparser 2.21
pyglet 2.0.10
Pygments 2.17.2
pyhash 0.9.3
PyOpenGL 3.1.0
pyparsing 3.1.1
pyrender 0.1.45
python-dateutil 2.8.2
pytorch-lightning 1.8.6
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.10.3
requests 2.31.0
rich 13.7.0
safetensors 0.4.0
scikit-learn 1.3.2
scipy 1.10.1
sentence-transformers 2.2.2
sentencepiece 0.1.99
sentry-sdk 1.37.1
setproctitle 1.3.3
setuptools 57.5.0
six 1.16.0
smmap 5.0.1
tacto 0.0.3 /fuyujie/calvin/calvin_env/tacto
tenacity 8.2.3
tensorboardX 2.6.2.2
termcolor 2.3.0
threadpoolctl 3.2.0
tokenizers 0.15.0
torch 1.13.1
torchmetrics 1.2.0
torchvision 0.14.1
tqdm 4.66.1
transformers 4.35.2
trimesh 4.0.5
typing_extensions 4.8.0
tzdata 2023.3
urdfpy 0.0.22
urllib3 2.1.0
wandb 0.16.0
wheel 0.41.2
yarl 1.9.3
zipp 3.17.0
[2023-12-04 01:09:51,860][calvin_env.envs.play_table_env][INFO] - Using calvin_env with commit 1431a46bd36bde5903fb6345e68b5ccc30def666. [2023-12-04 01:09:51,861][calvin_agent.wrappers.calvin_env_wrapper][INFO] - Initialized PlayTableEnv for device cuda:0 [2023-12-04 01:09:51,876][calvin_agent.evaluation.multistep_sequences][INFO] - Start generating evaluation sequences. [2023-12-04 01:10:07,176][calvin_agent.evaluation.multistep_sequences][INFO] - Done generating evaluation sequences. [2023-12-04 01:10:07,180][calvin_agent.models.mcil][INFO] - Start validation epoch 0 Exception in thread IntMsgThr: Traceback (most recent call last): File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run self._target(*self._args, self._kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages self._loop_check_status( File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status local_handle = request() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 766, in deliver_internal_messages return self._deliver_internal_messages(internal_message) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 490, in _deliver_internal_messages return self._deliver_record(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record handle = mailbox._deliver_record(record, interface=self) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record interface._publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request Exception in thread NetStatThr: self._send_message(msg) Traceback (most recent call last): File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner self._sendall_with_error_handle(header + data) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) self.run() BrokenPipeError: [Errno 32] Broken pipe File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status self._loop_check_status( File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status local_handle = request() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 758, in deliver_network_status return self._deliver_network_status(status) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 484, in _deliver_network_status return self._deliver_record(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record handle = mailbox._deliver_record(record, interface=self) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record interface._publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe Exception in thread ChkStopThr: Traceback (most recent call last): File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run self._target(self._args, self._kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status self._loop_check_status( File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status local_handle = request() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 750, in deliver_stop_status return self._deliver_stop_status(status) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 468, in _deliver_stop_status return self._deliver_record(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record handle = mailbox._deliver_record(record, interface=self) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record interface._publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]Error executing job with overrides: ['datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset', 'datamodule/datasets=vision_lang_shm'] Traceback (most recent call last): File "training.py", line 68, in train trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_train self._run_sanity_check() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _run_sanity_check val_loop.run() File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance output = self._evaluation_step(kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step output = self.trainer._call_strategy_hook(hook_name, kwargs.values()) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook output = fn(args, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step return self.model.validation_step(*args, kwargs) File "/fuyujie/calvin/calvin_models/calvin_agent/models/mcil.py", line 345, in validation_step else self.language_goal(dataset_batch["lang"]) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1215, in _call_impl hook_result = hook(self, input, result) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/wandb_torch.py", line 349, in after_forwardhook wandb.run.summary["graph%i" % graph_idx] = self File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 52, in setitem self.update({key: val}) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 74, in update self._update(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 128, in _update self._update_callback(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn return func(self, *args, *kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1388, in _summary_update_callback self._backend.interface.publish_summary(summary_record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 259, in publish_summary pb_summary_record = self._make_summary(summary_record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 237, in _make_summary json_value = self._summary_encode(item.value, path_from_root) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 210, in _summary_encode val_to_json(self._run, path_from_root, value, namespace="summary") File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/utils.py", line 164, in val_to_json val.bind_to_run(run, key, namespace) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/data_types.py", line 1452, in bind_to_run super().bind_to_run(args, kwargs) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/base_types/media.py", line 134, in bind_to_run _datatypes_callback(media_path) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/_globals.py", line 19, in _datatypes_callback _glob_datatypes_callback(fname) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1417, in _datatypes_callback self._backend.interface.publish_files(files) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 276, in publish_files self._publish_files(files) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 378, in _publish_files self._publish(rec) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe
And it works well
②Then I tried to modify the training.py
I commented two places about logger:
and it begain training successfully, but when beginning training the epoch 1(epoch 0 is good), it becomes more and more slower, and when it reaches the 100%, it sticks there permanently(at least 15 min), like this:
2. multi GPU error
command: python training.py datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset datamodule/datasets=vision_lang_shm trainer.devices=-1
error:
Thanks so much for your attention and help!