Open zhibeiyou135 opened 8 months ago
You can see what the issue is from the error:
File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 229, in _get_public_run
runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
I.e. in one of the three attributes listed in this line are None for some reason
1) Which wandb version are you using (the one in the installation is 0.14.0)?
2) Change the line to runpath = experiment.entity + '/' + experiment.project + '/' + experiment.id
. Do you still encounter the same issue after this change?
3) Can you figure out which of the three attributes is None?
om,I encounter the same problem,how do you solve it? It's wandb version error or other?
Below are the execution commands and console output results for my entire training process.
(rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ DATA_DIR=/home/pe/projects/yxl/gen1 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ MDL_CFG=tiny (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ GPU_IDS=0 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ BATCH_SIZE_PER_GPU=8 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ TRAIN_WORKERS_PER_GPU=6 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ EVAL_WORKERS_PER_GPU=2 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ python train.py model=rnndet dataset=gen1 dataset.path=${DATA_DIR} wandb.project_name=RVT \
Disabling PL seed everything because of unresolved issues with shuffling during training on streaming datasets new run: generating id zee59lta wandb: WARNING
resume
will be ignored since W&B syncing is set tooffline
. Starting a new run with run id zee59lta. wandb: Tracking run with wandb version 0.14.0 wandb: W&B syncing is set tooffline
in this directory. wandb: Runwandb online
or set WANDB_MODE=online to enable cloud syncing. wandb: logging graph, to disable usewandb.watch(log_graph=False)
Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a defaultModelSummary
callback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUsTrainer(limit_train_batches=1.0)
was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0)
was configured so 100% of the batches will be used.. [Train] Local batch size for: stream sampling: 4 random sampling: 4 [Train] Local num workers for: stream sampling: 3 random sampling: 3 creating rnd access train datasets: 1458it [00:01, 1117.94it/s] creating streaming train datasets: 1458it [00:03, 410.78it/s] num_full_sequences=317 num_splits=1141 num_split_sequences=5492 creating streaming val datasets: 429it [00:00, 1079.97it/s] num_full_sequences=429 num_splits=0 num_split_sequences=0 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]| Name | Type | Params
0 | mdl | YoloXDetector | 4.4 M 1 | mdl.backbone | RNNDetector | 3.2 M 2 | mdl.fpn | YOLOPAFPN | 710 K 3 | mdl.yolox_head | YOLOXHead | 474 K
4.4 M Trainable params 0 Non-trainable params 4.4 M Total params 8.810 Total estimated model params size (MB) Sanity Checking: 0it [00:00, ?it/s]/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the
num_workers
argument(try 32 which is the number of cpus on this machine) in the
DataLoader` init to improve performance. rank_zero_warn( Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Epoch 0: : 0it [00:00, ?it/s]/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:139: UserWarning:Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order:optimizer.step()
beforelr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rateEpoch 0: : 248it [01:01, 4.02it/s, loss=16.3, v_num=9lta][2024-01-28 00:22:02,035][urllib3.connectionpool][WARNING] - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/ Epoch 0: : 297it [01:12, 4.11it/s, loss=14.1, v_num=9lta][2024-01-28 00:22:12,484][urllib3.connectionpool][WARNING] - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/ Epoch 0: : 344it [01:22, 4.18it/s, loss=14, v_num=9lta][2024-01-28 00:22:22,654][urllib3.connectionpool][WARNING] - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/ Epoch 0: : 144434it [8:35:11, 4.67it/s, loss=2.73, v_num=9lta]creating index... index created!aLoader 0: : 2342it [03:33, 10.97it/s] Loading and preparing results... DONE (t=0.16s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox DONE (t=3.08s). Accumulating evaluation results... DONE (t=1.00s). Epoch 0: : 144434it [8:35:17, 4.67it/s, loss=2.73, v_num=9lta]Epoch 0, global step 142092: 'val/AP' reached 0.42821 (best 0.42821), saving model to '/home/pe/projects/yxl/RVT/RVT-master/RVT/zee59lta/checkpoints/epoch000step142092val_AP0.43.ckpt' as top 1 Error executing job with overrides: ['model=rnndet', 'dataset=gen1', 'dataset.path=/home/pe/projects/yxl/gen1', 'wandb.project_name=RVT', 'wandb.group_name=gen1', '+experiment/gen1=tiny.yaml', 'hardware.gpus=0', 'batch_size.train=8', 'batch_size.eval=8', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2'] Traceback (most recent call last): File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train self.fit_loop.run() File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.on_advance_end() File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 295, in on_advance_end self.trainer._call_callback_hooks("on_train_epoch_end") File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1380, in _call_callback_hooks fn(self, self.lightning_module, *args, *kwargs) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 312, in on_train_epoch_end self._save_topk_checkpoint(trainer, monitor_candidates) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 369, in _save_topk_checkpoint self._save_monitor_checkpoint(trainer, monitor_candidates) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 650, in _save_monitor_checkpoint self._update_best_and_save(current, trainer, monitor_candidates) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 701, in _update_best_and_save self._save_checkpoint(trainer, filepath) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 381, in _save_checkpoint logger.after_save_checkpoint(proxy(self)) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn return fn(args, kwargs) File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 218, in after_save_checkpoint self._scan_and_log_checkpoints(checkpoint_callback, self._save_last and not self._save_last_only_final) File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 268, in _scan_and_log_checkpoints num_ckpt_logged_before = self._num_logged_artifact() File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 235, in _num_logged_artifact public_run = self._get_public_run() File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 229, in _get_public_run runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/pe/projects/yxl/RVT/RVT-master/train.py", line 144, in main trainer.fit(model=module, ckpt_path=ckpt_path, datamodule=data_module) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 62, in _call_and_handle_interrupt logger.finalize("failed") File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn return fn(*args, **kwargs) File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 224, in finalize self._scan_and_log_checkpoints(self._checkpoint_callback, self._save_last) File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 268, in _scan_and_log_checkpoints num_ckpt_logged_before = self._num_logged_artifact() File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 235, in _num_logged_artifact public_run = self._get_public_run() File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 229, in _get_public_run runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. wandb: Waiting for W&B process to finish... (failed 1). wandb: wandb: Run history: wandb: epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: learning_rate ████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁▁ wandb: train/cls_loss_step █▅▄▅▃▃▃▄▃▃▁▂▃▂▂▁▂▂▃▁▃▁▁▂▂▂▂▂▂▁▂▂▁▁▁▂▁▁▂▁ wandb: train/conf_loss_step █▄▃▄▂▂▂▄▂▂▂▂▂▂▁▁▂▂▂▁▂▂▁▂▂▂▂▂▁▁▂▂▁▁▁▂▁▁▁▂ wandb: train/iou_loss_step █▆▄▅▄▃▃▅▃▃▁▃▃▂▂▁▂▂▃▁▃▁▁▂▂▂▂▂▂▁▂▂▁▁▁▂▁▂▂▁ wandb: train/l1_loss_step ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: train/loss_step █▅▃▄▃▃▂▄▃▃▁▂▂▂▂▁▂▂▃▁▃▁▁▂▂▂▂▂▁▁▂▂▁▁▁▂▁▁▁▂ wandb: train/num_fg_step ▁▄▅▄▅▆▅▄▆▅█▅▆▇▇█▇▆▅▇▅▇█▆▇▇▇▇██▇▆▇██▇█▇▇▇ wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: val/AP ▁ wandb: val/AP_50 ▁ wandb: val/AP_75 ▁ wandb: val/AP_L ▁ wandb: val/AP_M ▁ wandb: val/AP_S ▁ wandb: wandb: Run summary: wandb: epoch 0 wandb: learning_rate 0.00013 wandb: train/cls_loss_step 0.41309 wandb: train/conf_loss_step 0.94029 wandb: train/iou_loss_step 1.3903 wandb: train/l1_loss_step 0.0 wandb: train/loss_step 2.74368 wandb: train/num_fg_step 7.48387 wandb: trainer/global_step 142094 wandb: val/AP 0.42821 wandb: val/AP_50 0.68434 wandb: val/AP_75 0.44886 wandb: val/AP_L 0.44424 wandb: val/AP_M 0.49079 wandb: val/AP_S 0.35797 wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/pe/projects/yxl/RVT/RVT-master/wandb/offline-run-20240128_002051-zee59lta wandb: Find logs at: ./wandb/offline-run-20240128_002051-zee59lta/logs == Timing statistics ==