The shape of the mask [4, 4, 128, 128] at index 1 does not match the shape of the indexed tensor [4, 80, 128, 128]

dmxj commented 5 years ago

I try to train my own dataset which has 4 categories with coco format, I change the config/CornerNet.json "db: categories" from 80 to 4, but it raise a error when I launch train: The whole train log is:

loading all datasets... using 4 threads loading from cache file: ./cache/coco_train.pkl loading annotations into memory... Done (t=0.07s) creating index... index created! loading from cache file: ./cache/coco_train.pkl loading annotations into memory... Done (t=0.02s) creating index... index created! loading from cache file: ./cache/coco_train.pkl loading annotations into memory... Done (t=0.02s) creating index... index created! loading from cache file: ./cache/coco_train.pkl loading annotations into memory... Done (t=0.02s) creating index... index created! loading from cache file: ./cache/coco_val.pkl loading annotations into memory... Done (t=0.00s) creating index... index created! system config... {'batch_size': 9, 'cache_dir': './cache', 'chunk_sizes': [4, 5], 'config_dir': './config', 'data_dir': '/root/dataset/coco_nls', 'data_rng': <mtrand.RandomState object at 0x7fe9526c8438>, 'dataset': 'MSCOCO', 'decay_rate': 10, 'display': 5, 'learning_rate': 0.00025, 'max_iter': 500000, 'nnet_rng': <mtrand.RandomState object at 0x7fe9526c8480>, 'opt_algo': 'adam', 'prefetch_size': 5, 'pretrain': None, 'result_dir': './results', 'sampling_function': 'kp_detection', 'snapshot': 5000, 'snapshot_name': 'CornerNet', 'stepsize': 450000, 'test_split': 'testdev', 'train_split': 'trainval', 'val_iter': 100, 'val_split': 'minival', 'weight_decay': False, 'weight_decay_rate': 1e-05, 'weight_decay_type': 'l2'} db config... {'ae_threshold': 0.5, 'border': 128, 'categories': 4, 'data_aug': True, 'gaussian_bump': True, 'gaussian_iou': 0.7, 'gaussian_radius': -1, 'input_size': [511, 511], 'lighting': True, 'max_per_image': 100, 'merge_bbox': False, 'nms_algorithm': 'exp_soft_nms', 'nms_kernel': 3, 'nms_threshold': 0.5, 'output_sizes': [[128, 128]], 'rand_color': True, 'rand_crop': True, 'rand_pushes': False, 'rand_samples': False, 'rand_scale_max': 1.4, 'rand_scale_min': 0.6, 'rand_scale_step': 0.1, 'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]), 'special_crop': False, 'test_scales': [1], 'top_k': 100, 'weight_exp': 8} len of db: 3335 start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... building model... start prefetching data... module_file: models.CornerNet shuffling indices... total parameters: 201035212 setting learning rate to: 0.00025 training start... 0%| | 0/500000 [00:00<?, ?it/s]/root/anaconda3/envs/CornerNet/lib/python3.6/site-packages/torch/nn/modules/upsampling.py:122: UserWarning: nn.Upsampling is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.Upsampling is deprecated. Use nn.functional.interpolate instead.") Traceback (most recent call last): File "train.py", line 195, in train(training_dbs, validation_db, args.start_iter) File "train.py", line 137, in train training_loss = nnet.train(training) File "/root/CornerNet/nnet/py_factory.py", line 81, in train loss = self.network(xs, ys) File "/root/anaconda3/envs/CornerNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/root/CornerNet/models/py_utils/data_parallel.py", line 70, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/root/CornerNet/models/py_utils/data_parallel.py", line 80, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/root/anaconda3/envs/CornerNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply raise output File "/root/anaconda3/envs/CornerNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker output = module(input, kwargs) File "/root/anaconda3/envs/CornerNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/root/CornerNet/nnet/py_factory.py", line 20, in forward loss = self.loss(preds, ys, kwargs) File "/root/anaconda3/envs/CornerNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, **kwargs) File "/root/CornerNet/models/py_utils/kp.py", line 289, in forward focal_loss += self.focal_loss(tl_heats, gt_tl_heat) File "/root/CornerNet/models/py_utils/kp_utils.py", line 160, in _neg_loss pos_pred = pred[pos_inds] RuntimeError: The shape of the mask [4, 4, 128, 128] at index 1 does not match the shape of the indexed tensor [4, 80, 128, 128] at index 1

what should I do ?

stasysp commented 5 years ago

If I understood correctly, there are 3 places where n_classes should be changed: out_dim in CornerNet.py, categories in CornerNet.json and self._configs["categories"] in detection.py. Have you achieved any results?

qusongyun commented 5 years ago

in

Hello have you successfully used this network to train your own dataset? I could not got a correct result ,after trained ,all the boxes still seemed like default box .

stasysp commented 5 years ago

Hello @qusongyun, I have the same problem, maybe I have wrong image preprocessing. Bboxes change a little from epoch to epoch, but they look the same for all pictures for each epoch.

qusongyun commented 5 years ago

Hello @qusongyun, I have the same problem, maybe I have wrong image preprocessing. Bboxes change a little from epoch to epoch, but they look the same for all pictures for each epoch.

Maybe my iteration ( 10 epoches) is too few (I have only 1 GPU with 12GB).So it may has not yet converge .How many epoches had you trained your dataset?

stasysp commented 5 years ago

I have trained 100 epochs with batch size 2, my GPU is 12 GB too. Train losses were decreasing, but val and test losses were not. And boxes looked the same for every pictures at a epoch. I've tried to overfit the network but haven't achieved any results.

Hello @qusongyun, I have the same problem, maybe I have wrong image preprocessing. Bboxes change a little from epoch to epoch, but they look the same for all pictures for each epoch.

Maybe my iteration ( 10 epoches) is too few (I have only 1 GPU with 12GB).So it may has not yet converge .How many epoches had you trained your dataset?

qusongyun commented 5 years ago

I have trained 100 epochs with batch size 2, my GPU is 12 GB too. Train losses were decreasing, but val and test losses were not. And boxes looked the same for every pictures at a epoch. I've tried to overfit the network but haven't achieved any results.

Hello @qusongyun, I have the same problem, maybe I have wrong image preprocessing. Bboxes change a little from epoch to epoch, but they look the same for all pictures for each epoch.

Maybe my iteration ( 10 epoches) is too few (I have only 1 GPU with 12GB).So it may has not yet converge .How many epoches had you trained your dataset?

Maybe there are some codes should be changed but we cannot get it.

liben2018 commented 5 years ago

@stasysp @qusongyun, I guess the training code is good, but val and test code need to change. Maybe you can visualize the results by using https://github.com/princeton-vl/CornerNet/pull/60/commits to check what is the error in your test code, then to revise that.

stasysp commented 5 years ago

@qusongyun @liben2018 thank you =) my training code has achieved the same mAP like SSD, but val not. Maybe there are problems in _decode function because I cannot see the bboxes correctly.

moothes commented 5 years ago

I trained CornerNet with 8 GPUs (12GB) using source code. After 260K interations, it only reach 24.6 mAP.

stasysp commented 5 years ago

@moothes are the bboxes look ok? My bboxes look similar for all input

princeton-vl / CornerNet

The shape of the mask [4, 4, 128, 128] at index 1 does not match the shape of the indexed tensor [4, 80, 128, 128] #79