In the training process, it will stop after completing one epoch.

zhibeiyou135 commented 8 months ago

Below are the execution commands and console output results for my entire training process.

(rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ DATA_DIR=/home/pe/projects/yxl/gen1 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ MDL_CFG=tiny (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ GPU_IDS=0 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ BATCH_SIZE_PER_GPU=8 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ TRAIN_WORKERS_PER_GPU=6 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ EVAL_WORKERS_PER_GPU=2 (rvt) pe@505-1-ubuntu20-04-5-lts:~/projects/yxl/RVT/RVT-master$ python train.py model=rnndet dataset=gen1 dataset.path=${DATA_DIR} wandb.project_name=RVT \

wandb.group_name=gen1 +experiment/gen1="${MDL_CFG}.yaml" hardware.gpus=${GPU_IDS} \ batch_size.train=${BATCH_SIZE_PER_GPU} batch_size.eval=${BATCH_SIZE_PER_GPU} \ hardware.num_workers.train=${TRAIN_WORKERS_PER_GPU} hardware.num_workers.eval=${EVAL_WORKERS_PER_GPU} Using python-based detection evaluation Set MaxViTRNN backbone (height, width) to (256, 320) Set partition sizes: (8, 10) Set num_classes=2 for detection head ------ Configuration ------ reproduce: seed_everything: null deterministic_flag: false benchmark: false training: precision: 16 max_epochs: 10000 max_steps: 400000 learning_rate: 0.0002 weight_decay: 0 gradient_clip_val: 1.0 limit_train_batches: 1.0 lr_scheduler: use: true total_steps: ${..max_steps} pct_start: 0.005 div_factor: 20 final_div_factor: 10000 validation: limit_val_batches: 1.0 val_check_interval: null check_val_every_n_epoch: 1 batch_size: train: 8 eval: 8 hardware: num_workers: train: 6 eval: 2 gpus: 0 dist_backend: nccl logging: ckpt_every_n_epochs: 1 train: metrics: compute: false detection_metrics_every_n_steps: null log_model_every_n_steps: 5000 log_every_n_steps: 500 high_dim: enable: true every_n_steps: 5000 n_samples: 4 validation: high_dim: enable: true every_n_epochs: 1 n_samples: 8 wandb: wandb_runpath: null artifact_name: null artifact_local_file: null resume_only_weights: false group_name: gen1 project_name: RVT dataset: name: gen1 path: /home/pe/projects/yxl/gen1 train: sampling: mixed random: weighted_sampling: false mixed: w_stream: 1 w_random: 1 eval: sampling: stream data_augmentation: random: prob_hflip: 0.5 rotate: prob: 0 min_angle_deg: 2 max_angle_deg: 6 zoom: prob: 0.8 zoom_in: weight: 8 factor: min: 1 max: 1.5 zoom_out: weight: 2 factor: min: 1 max: 1.2 stream: prob_hflip: 0.5 rotate: prob: 0 min_angle_deg: 2 max_angle_deg: 6 zoom: prob: 0.5 zoom_out: factor: min: 1 max: 1.2 ev_repr_name: stacked_histogram_dt=50_nbins=10 sequence_length: 21 resolution_hw:

240

304 downsample_by_factor_2: false only_load_end_labels: false model: name: rnndet backbone: name: MaxViTRNN compile: enable: false args: mode: reduce-overhead input_channels: 20 enable_masking: false partition_split_32: 1 embed_dim: 32 dim_multiplier:

1

2

4

8 num_blocks:

1

1

1

1 T_max_chrono_init:

4

8

16

32 stem: patch_size: 4 stage: downsample: type: patch overlap: true norm_affine: true attention: use_torch_mha: false partition_size:

8

10 dim_head: 32 attention_bias: true mlp_activation: gelu mlp_gated: false mlp_bias: true mlp_ratio: 4 drop_mlp: 0 drop_path: 0 ls_init_value: 1.0e-05 lstm: dws_conv: false dws_conv_only_hidden: true dws_conv_kernel_size: 3 drop_cell_update: 0 in_res_hw:

256

320 fpn: name: PAFPN compile: enable: false args: mode: reduce-overhead depth: 0.33 in_stages:

2

3

4 depthwise: false act: silu head: name: YoloX compile: enable: false args: mode: reduce-overhead depthwise: false act: silu num_classes: 2 postprocess: confidence_threshold: 0.1 nms_threshold: 0.45

Disabling PL seed everything because of unresolved issues with shuffling during training on streaming datasets new run: generating id zee59lta wandb: WARNING resume will be ignored since W&B syncing is set to offline. Starting a new run with run id zee59lta. wandb: Tracking run with wandb version 0.14.0 wandb: W&B syncing is set to offline in this directory. wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. wandb: logging graph, to disable use wandb.watch(log_graph=False) Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default ModelSummary callback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used.. Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. [Train] Local batch size for: stream sampling: 4 random sampling: 4 [Train] Local num workers for: stream sampling: 3 random sampling: 3 creating rnd access train datasets: 1458it [00:01, 1117.94it/s] creating streaming train datasets: 1458it [00:03, 410.78it/s] num_full_sequences=317 num_splits=1141 num_split_sequences=5492 creating streaming val datasets: 429it [00:00, 1079.97it/s] num_full_sequences=429 num_splits=0 num_split_sequences=0 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

| Name | Type | Params

0 | mdl | YoloXDetector | 4.4 M 1 | mdl.backbone | RNNDetector | 3.2 M 2 | mdl.fpn | YOLOPAFPN | 710 K 3 | mdl.yolox_head | YOLOXHead | 474 K

4.4 M Trainable params 0 Non-trainable params 4.4 M Total params 8.810 Total estimated model params size (MB) Sanity Checking: 0it [00:00, ?it/s]/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 32 which is the number of cpus on this machine) in theDataLoader` init to improve performance. rank_zero_warn( Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Epoch 0: : 0it [00:00, ?it/s]/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:139: UserWarning:

Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

Epoch 0: : 248it [01:01, 4.02it/s, loss=16.3, v_num=9lta][2024-01-28 00:22:02,035][urllib3.connectionpool][WARNING] - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/ Epoch 0: : 297it [01:12, 4.11it/s, loss=14.1, v_num=9lta][2024-01-28 00:22:12,484][urllib3.connectionpool][WARNING] - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/ Epoch 0: : 344it [01:22, 4.18it/s, loss=14, v_num=9lta][2024-01-28 00:22:22,654][urllib3.connectionpool][WARNING] - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/ Epoch 0: : 144434it [8:35:11, 4.67it/s, loss=2.73, v_num=9lta]creating index... index created!aLoader 0: : 2342it [03:33, 10.97it/s] Loading and preparing results... DONE (t=0.16s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox DONE (t=3.08s). Accumulating evaluation results... DONE (t=1.00s). Epoch 0: : 144434it [8:35:17, 4.67it/s, loss=2.73, v_num=9lta]Epoch 0, global step 142092: 'val/AP' reached 0.42821 (best 0.42821), saving model to '/home/pe/projects/yxl/RVT/RVT-master/RVT/zee59lta/checkpoints/epoch000step142092val_AP0.43.ckpt' as top 1 Error executing job with overrides: ['model=rnndet', 'dataset=gen1', 'dataset.path=/home/pe/projects/yxl/gen1', 'wandb.project_name=RVT', 'wandb.group_name=gen1', '+experiment/gen1=tiny.yaml', 'hardware.gpus=0', 'batch_size.train=8', 'batch_size.eval=8', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2'] Traceback (most recent call last): File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train self.fit_loop.run() File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.on_advance_end() File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 295, in on_advance_end self.trainer._call_callback_hooks("on_train_epoch_end") File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1380, in _call_callback_hooks fn(self, self.lightning_module, *args, *kwargs) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 312, in on_train_epoch_end self._save_topk_checkpoint(trainer, monitor_candidates) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 369, in _save_topk_checkpoint self._save_monitor_checkpoint(trainer, monitor_candidates) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 650, in _save_monitor_checkpoint self._update_best_and_save(current, trainer, monitor_candidates) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 701, in _update_best_and_save self._save_checkpoint(trainer, filepath) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 381, in _save_checkpoint logger.after_save_checkpoint(proxy(self)) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn return fn(args, kwargs) File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 218, in after_save_checkpoint self._scan_and_log_checkpoints(checkpoint_callback, self._save_last and not self._save_last_only_final) File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 268, in _scan_and_log_checkpoints num_ckpt_logged_before = self._num_logged_artifact() File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 235, in _num_logged_artifact public_run = self._get_public_run() File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 229, in _get_public_run runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/pe/projects/yxl/RVT/RVT-master/train.py", line 144, in main trainer.fit(model=module, ckpt_path=ckpt_path, datamodule=data_module) File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 62, in _call_and_handle_interrupt logger.finalize("failed") File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn return fn(*args, **kwargs) File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 224, in finalize self._scan_and_log_checkpoints(self._checkpoint_callback, self._save_last) File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 268, in _scan_and_log_checkpoints num_ckpt_logged_before = self._num_logged_artifact() File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 235, in _num_logged_artifact public_run = self._get_public_run() File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 229, in _get_public_run runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. wandb: Waiting for W&B process to finish... (failed 1). wandb: wandb: Run history: wandb: epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: learning_rate ████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁▁ wandb: train/cls_loss_step █▅▄▅▃▃▃▄▃▃▁▂▃▂▂▁▂▂▃▁▃▁▁▂▂▂▂▂▂▁▂▂▁▁▁▂▁▁▂▁ wandb: train/conf_loss_step █▄▃▄▂▂▂▄▂▂▂▂▂▂▁▁▂▂▂▁▂▂▁▂▂▂▂▂▁▁▂▂▁▁▁▂▁▁▁▂ wandb: train/iou_loss_step █▆▄▅▄▃▃▅▃▃▁▃▃▂▂▁▂▂▃▁▃▁▁▂▂▂▂▂▂▁▂▂▁▁▁▂▁▂▂▁ wandb: train/l1_loss_step ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: train/loss_step █▅▃▄▃▃▂▄▃▃▁▂▂▂▂▁▂▂▃▁▃▁▁▂▂▂▂▂▁▁▂▂▁▁▁▂▁▁▁▂ wandb: train/num_fg_step ▁▄▅▄▅▆▅▄▆▅█▅▆▇▇█▇▆▅▇▅▇█▆▇▇▇▇██▇▆▇██▇█▇▇▇ wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: val/AP ▁ wandb: val/AP_50 ▁ wandb: val/AP_75 ▁ wandb: val/AP_L ▁ wandb: val/AP_M ▁ wandb: val/AP_S ▁ wandb: wandb: Run summary: wandb: epoch 0 wandb: learning_rate 0.00013 wandb: train/cls_loss_step 0.41309 wandb: train/conf_loss_step 0.94029 wandb: train/iou_loss_step 1.3903 wandb: train/l1_loss_step 0.0 wandb: train/loss_step 2.74368 wandb: train/num_fg_step 7.48387 wandb: trainer/global_step 142094 wandb: val/AP 0.42821 wandb: val/AP_50 0.68434 wandb: val/AP_75 0.44886 wandb: val/AP_L 0.44424 wandb: val/AP_M 0.49079 wandb: val/AP_S 0.35797 wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/pe/projects/yxl/RVT/RVT-master/wandb/offline-run-20240128_002051-zee59lta wandb: Find logs at: ./wandb/offline-run-20240128_002051-zee59lta/logs == Timing statistics ==

magehrig commented 8 months ago

You can see what the issue is from the error:

File "/home/pe/projects/yxl/RVT/RVT-master/loggers/wandb_logger.py", line 229, in _get_public_run
runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

I.e. in one of the three attributes listed in this line are None for some reason

1) Which wandb version are you using (the one in the installation is 0.14.0)? 2) Change the line to runpath = experiment.entity + '/' + experiment.project + '/' + experiment.id. Do you still encounter the same issue after this change? 3) Can you figure out which of the three attributes is None?

leafyseay commented 5 months ago

om,I encounter the same problem,how do you solve it? It's wandb version error or other?

uzh-rpg / RVT

In the training process, it will stop after completing one epoch. #39

| Name | Type | Params

0 | mdl | YoloXDetector | 4.4 M 1 | mdl.backbone | RNNDetector | 3.2 M 2 | mdl.fpn | YOLOPAFPN | 710 K 3 | mdl.yolox_head | YOLOXHead | 474 K