uzh-rpg / RVT

Implementation of "Recurrent Vision Transformers for Object Detection with Event Cameras". CVPR 2023
MIT License
315 stars 41 forks source link

`Trainer.fit` stopped: `max_steps=400000` reached. #40

Open zhibeiyou135 opened 8 months ago

zhibeiyou135 commented 8 months ago

Epoch 2: : 17671it [1:05:34, 4.49it/s, loss=2.15, v_num=3hqp]wandb: Network error (TransientError), entering retry loop. Epoch 2: : 25832it [1:35:47, 4.49it/s, loss=2.23, v_num=3hqp]wandb: Network error (TransientError), entering retry loop. Epoch 2: : 115816it [7:02:37, 4.57it/s, loss=2.1, v_num=3hqp]Epoch 2, global step 400000: 'val/AP' was not in top 1 self._num_logged_artifact() = 1 num_ckpt_logged_before = 1 num_new_cktps = 1 Trainer.fit stopped: max_steps=400000 reached. Epoch 2: : 115816it [7:03:13, 4.56it/s, loss=2.1, v_num=3hqp] wandb: Waiting for W&B process to finish... (success).

The provided code reached max_steps after only two epochs. Is there a problem somewhere? If I want to train for more epochs, what should I do?

magehrig commented 8 months ago

Hi @zhibeiyou135

The config specifies that the maximum number of steps is 400k. The epoch counter is misleading as you have actually seen batch_size times the number of epochs that are shown in the terminal. This has to do with how dataloading happens here: https://github.com/uzh-rpg/RVT/blob/af1786cd987e25dc4d78392ad36cdabc4adeea2c/data/utils/stream_concat_datapipe.py#L70-L72

If you want to increase the number of iterations, just increase max_steps to the value you want