rawalkhirodkar / MIPNet

Apache License 2.0
74 stars 13 forks source link

Question about Training #11

Closed lutianhao closed 2 years ago

lutianhao commented 2 years ago

Hi, I try to training my coco style dataset by your scripts, I dont know which bash script should be used to train.(Could you please briefly explain the function of each script?) Then I use "scripts/train/lambda/coco/train.sh" this one for training. but one error happened.

cd /data_2/lutianhao/code/MIPNet/ CUDA_VISIBLE_DEVICES=4,5,6,7, python tools/lambda/train_lambda_real.py \ --cfg experiments/coco/hrnet/w48_384x288_adam_lr1e-3.yaml \ GPUS '(0,1,2,3,)' \ OUTPUT_DIR 'Outputs/outputs/lambda/lambda_coco_real_waffle'\ LOG_DIR 'Outputs/logs/lambda/lambda_coco_real_waffle'\ TEST.MODEL_FILE 'models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth' \ DATASET.TRAIN_DATASET 'coco_lambda' \ DATASET.TRAIN_SET 'train2017' \ DATASET.TRAIN_IMAGE_DIR '/data_2/lutianhao/datasets/pose/coco2017/train2017'\ DATASET.TRAIN_ANNOTATION_FILE '/data_2/lutianhao/datasets/pose/coco2017/annotations/person_keypoints_train2017.json' \ DATASET.TRAIN_DATASET_TYPE 'coco_lambda' \ DATASET.TEST_DATASET 'coco' \ DATASET.TEST_SET 'val2017' \ DATASET.TEST_IMAGE_DIR '/data_2/lutianhao/datasets/pose/coco2017/val2017'\ DATASET.TEST_ANNOTATION_FILE '/data_2/lutianhao/datasets/pose/coco2017/annotations/person_keypoints_val2017.json' \ DATASET.TEST_DATASET_TYPE 'coco' \ TRAIN.LR 0.001 \ TRAIN.BEGIN_EPOCH 0 \ TRAIN.END_EPOCH 110 \ TRAIN.LR_STEP '(70, 100)' \ TRAIN.BATCH_SIZE_PER_GPU 2 \ TEST.BATCH_SIZE_PER_GPU 1 \ TEST.USE_GT_BBOX True \ EPOCH_EVAL_FREQ 1 \ PRINT_FREQ 100 \ MODEL.NAME 'pose_hrnet_se_lambda' \ MODEL.SE_MODULES '[False, False, True, True]'

And the error is :

GAMMA1: 0.99 [0/927] GAMMA2: 0.0 LR: 0.001 LR_FACTOR: 0.1 LR_STEP: [70, 100] MOMENTUM: 0.9 NESTEROV: False OPTIMIZER: adam RESUME: False SHUFFLE: True WD: 0.0001 WORKERS: 24 => init weights from normal distribution => loading pretrained model models/pytorch/imagenet/hrnet_w48-8ef0771d.pth

Total Parameters: 63,746,081

Total Multiply Adds (For Convolution and Linear Layers only): 46.562052726745605 GFLOPs

Number of Layers Conv2d : 293 layers BatchNorm2d : 292 layers ReLU : 271 layers Bottleneck : 4 layers BasicBlock : 104 layers Upsample : 28 layers HighResolutionModule : 8 layers AdaptiveAvgPool2d : 5 l ayers Linear : 20 layers Sigmoid : 10 layers BatchNorm1d : 5 layers SELambdaLayer : 5 layers SELambdaModule : 2 layers => loading model from models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth => loading from latest_state_dict at models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth loading annotations into memory... Done (t=31.87s) creating index... index created! => classes: ['background', 'person'] => num_images: 118287 loading from cache from cache/coco_lambda/train2017/gt_db.pkl done! => load 149813 samples loading annotations into memory... Done (t=4.04s) creating index... index created! => classes: ['background', 'person'] => num_images: 5000 => load 6352 samples => resuming optimizer from models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth => updated lr schedule is [70, 100]

training on lambda Epoch: [0][0/18727] Time 64.338s (64.338s) Speed 0.2 samples/s Data 10.114s (10.114s) Loss 0.00020 (0.00020) Accuracy 0.513 (0.513) model_grad 0.000568 (0.000568) DivLoss -0.00074 (-0.00074) PoseLoss 0.00020 (0.00020) Traceback (most recent call last): File "tools/lambda/train_lambda_real.py", line 280, in main() File "tools/lambda/train_lambda_real.py", line 242, in main final_output_dir, tb_log_dir, writer_dict, print_prefix='lambda') File "/data_2/lutianhao/code/MIPNet/tools/lambda/../../lib/core/train.py", line 464, in trainlambda suffix += '[{}:{}]'.format(count, round(lambda_a[count + B].item(), 2)) IndexError: index 16 is out of bounds for dimension 0 with size 16

lutianhao commented 2 years ago

I've solved this error, but I'd like to know the mean of "size=16", would you please explain a few? Thanks!

rawalkhirodkar commented 2 years ago

Glad the error is resolved. The size of 16 is to visualize the samples during training, https://github.com/rawalkhirodkar/MIPNet/blob/505c92ec59ac79686a217dac45eb188fc38b8499/lib/core/train.py#L464

It looks like the error was due to having a batch size that was less than 16, in that case, you can update this constant to something smaller.