Open zhibeiyou135 opened 8 months ago
Hi @zhibeiyou135
The config specifies that the maximum number of steps is 400k. The epoch counter is misleading as you have actually seen batch_size times the number of epochs that are shown in the terminal. This has to do with how dataloading happens here: https://github.com/uzh-rpg/RVT/blob/af1786cd987e25dc4d78392ad36cdabc4adeea2c/data/utils/stream_concat_datapipe.py#L70-L72
If you want to increase the number of iterations, just increase max_steps to the value you want
Epoch 2: : 17671it [1:05:34, 4.49it/s, loss=2.15, v_num=3hqp]wandb: Network error (TransientError), entering retry loop. Epoch 2: : 25832it [1:35:47, 4.49it/s, loss=2.23, v_num=3hqp]wandb: Network error (TransientError), entering retry loop. Epoch 2: : 115816it [7:02:37, 4.57it/s, loss=2.1, v_num=3hqp]Epoch 2, global step 400000: 'val/AP' was not in top 1 self._num_logged_artifact() = 1 num_ckpt_logged_before = 1 num_new_cktps = 1
Trainer.fit
stopped:max_steps=400000
reached. Epoch 2: : 115816it [7:03:13, 4.56it/s, loss=2.1, v_num=3hqp] wandb: Waiting for W&B process to finish... (success).The provided code reached max_steps after only two epochs. Is there a problem somewhere? If I want to train for more epochs, what should I do?