Can't resume DiT training

carzacc commented 1 year ago

I am using DiT, and trying to finetune for layout analysis (object detection) on a dataset other than PubLayNet (end goal is to fine tune it to go beyond its current classification capabilities).

The problem arises when using:

[x] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

If I try to resume training, whether I use the config in object_detection or the generated one in the output directory, I get warnings like the following:

WARNING [08/26 17:26:55 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint:
backbone.fpn_lateral2.{bias, weight}
backbone.fpn_lateral3.{bias, weight}
backbone.fpn_lateral4.{bias, weight}
backbone.fpn_lateral5.{bias, weight}
backbone.fpn_output2.{bias, weight}
backbone.fpn_output3.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output5.{bias, weight}
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.mask_head.mask_fcn1.{bias, weight}
roi_heads.mask_head.mask_fcn2.{bias, weight}
roi_heads.mask_head.mask_fcn3.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
roi_heads.mask_head.predictor.{bias, weight}
WARNING [08/26 17:26:55 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
  backbone.bottom_up.backbone.backbone.fpn_lateral2.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output2.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_lateral3.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output3.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_lateral4.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output4.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_lateral5.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output5.{bias, weight}
  backbone.bottom_up.backbone.proposal_generator.rpn_head.conv.{bias, weight}
  backbone.bottom_up.backbone.proposal_generator.rpn_head.objectness_logits.{bias, weight}
  backbone.bottom_up.backbone.proposal_generator.rpn_head.anchor_deltas.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_head.fc1.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_head.fc2.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_predictor.cls_score.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_predictor.bbox_pred.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn1.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn2.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn3.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn4.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.deconv.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.predictor.{bias, weight}

and, when training starts, the loss starts very high (random, between 2 and 5) instead of being close to 0.5 which is where I had left it.

I need to be able to resume because of the policies of the university cluster I am using which don't allow me to train for long sessions.

Platform: Ubuntu 20.04.6, CUDA 11.4
Python version: 3.9.17
PyTorch version (GPU?): 1.9.1+cu111

carzacc commented 1 year ago

By the way, just FYI, I have opened PR #1242 which corrects one of the training examples in the README.

carzacc commented 1 year ago

https://github.com/microsoft/unilm/blob/b60c741f746877293bb85eed6806736fc8fa0ffd/dit/object_detection/ditod/mycheckpointer.py#L199C1-L205C10

by not appending that prefix I managed to fix the issue and correctly resume training, why is that there?

microsoft / unilm

Can't resume DiT training #1269