megvii-research / AnchorDETR

An official implementation of the Anchor DETR.
Other
338 stars 36 forks source link

num_feature_levels > 1 #9

Closed yformer closed 2 years ago

yformer commented 3 years ago

Hello,

When I tried num_feature_levels > 1, the code won't work. This line srcs = torch.cat(srcs, dim=1) in anchor_detr.py shows the mismatch tensor size error. Any ideas to fix it?

tangjiuqi097 commented 3 years ago

Hi, now you can only set num_feature_levels=3 to use multiple features.

yformer commented 3 years ago

@tangjiuqi097 Yeah, I set num_feature_levels = 3 and use multi-resolution features from different stages. It got the same error. The resolution of feature maps from different stages does not match when doing the torch.cat. I did not see where your code can handle it?

tangjiuqi097 commented 3 years ago

@yformer Do you make any other modifications? We use three features, i.e. down-strided C3, C4, dilated C5. They have the same size.

yformer commented 3 years ago

@tangjiuqi097 , I see. It is down-strided C3, C4, and dilated C5. I directly used the feature maps from different stages. Thanks for your clarification. For the reported results such as Anchor DETR | DC5 | 50 | 44.2 | 151 | 16 (19), does it use down-strided C3, C4, and dilated C5, or just a single dilated C5? With a single dilated C5, I can only get 41.7 AP via training. That is why I was trying num_feature_levels > 1.

tangjiuqi097 commented 3 years ago

@yformer Hi, the reported results use a single level feature. I have tested this released code multiple times, and the performance with R50-DC5 feature ranges from 44.0 to 44.5 (we release the model with a median value). The performance you trained is not reasonable. What modifications have you made?

yformer commented 3 years ago

Hi @tangjiuqi097, I did some changes for mapping 80 id of object categories for COCO detection to be continuous, 0, ... 79 and the predicted continuous category id will be inversely mapped to non-continuous category id before computing box AP. So I set the number of class, num_classes = 80, and changed the num_classes = 80 in transformer.py. In transformer.py, you manually set num_classes = 91 since the max id = 90. The num_classes depends on the max id in transformer.py, right?

Also, I use the 3-level features and the performance only increases a little bit, 41.7 to 42.27. Is the performance gain comparable to what you have?

tangjiuqi097 commented 3 years ago

@yformer Hi, it seems that mapping the categories id will not affect the performance. But I recommend you to reproduce the performance without any modification first. The improvement of 3 level feature is about 1 AP and I think your performance gain is reasonable by considering the fluctuation.

luohao123 commented 3 years ago

@yformer how's the speed effected by using 3 level features?

tangjiuqi097 commented 3 years ago

@luohao123 Hi, the effect of speed is dc5 16 fps vs. multi-level 11 fps.

yformer commented 3 years ago

@yformer Hi, it seems that mapping the categories id will not affect the performance. But I recommend you to reproduce the performance without any modification first. The improvement of 3 level feature is about 1 AP and I think your performance gain is reasonable by considering the fluctuation.

Yeah, I got the 44.2 AP result when using images per batch 8, learning rate 0.0001. But when I use images per batch 16, learning rate 0.0002, the performance drops to 43.7 AP. For images per batch 32, learning rate 0.0004, the performance further drops to 0. When using multiple (8) nodes for images per batch 8 8, learning rate 0.0001 8, the performance will also drop 41.2 AP. Any suggestion to fix this?

yformer commented 3 years ago

@yformer how's the speed effected by using 3 level features?

It becomes much slower. It will also make image per batch = 16 out of memory when training on V100s.

tangjiuqi097 commented 3 years ago

@yformer Hi, the linear scaling rule is too aggressive for the Adamw optimizer. You can try the square-root scaling, e.g. 4 images per card with 8GPU using learning rate 0.0001*2.

By the way, it seems very strange that the performance is 0 for batchsize 32 but is not 0 for batchsize 64.

yformer commented 3 years ago

I tried square-root scaling. For batchsize 32, learning rate 0.0002, the performance is 39.39 AP. Did you have a chance to train Anchor-DETR with larger batch size > 8?

luohao123 commented 3 years ago

@yformer I am training AnchorDETR with BS = 80 now, (8GPUS of 3090). lr=0.00025 Final result will report after fully trained.

tangjiuqi097 commented 3 years ago

@yformer I have tried the R50-C5 model with batchsize 16 (2 images per card with 8 GPU) to check the key_padding_mask for RCDA, and the performances are comparable (41.6 vs. 41.8) and the performances could be the same when carefully adjusts learning rate to 0.00015. I have not tried the larger batchsize after that when the code keeps updating.

BTW, I think this lower performance is not just related to the batchsize and learning rate. Do you have any other modifications?

For batchsize 32, learning rate 0.0002, the performance is 39.39 AP

yformer commented 3 years ago

@yformer I am training AnchorDETR with BS = 80 now, (8GPUS of 3090). lr=0.00025 Final result will report after fully trained.

@luohao123 Cool. It will be great if you can share your larger batchsize result.

yformer commented 3 years ago

@yformer I have tried the R50-C5 model with batchsize 16 (2 images per card with 8 GPU) to check the key_padding_mask for RCDA, and the performances are comparable (41.6 vs. 41.8) and the performances could be the same when carefully adjusts learning rate to 0.00015. I have not tried the larger batchsize after that when the code keeps updating.

BTW, I think this lower performance is not just related to the batchsize and learning rate. Do you have any other modifications?

For batchsize 32, learning rate 0.0002, the performance is 39.39 AP

@tangjiuqi097 You mention the performance is 41.6 when using batchsize 16. Using learning rate = 0.0002, the performance is 41.6, which is 2.7 AP lower than batchsize 8 and learning rate 0.0001. When you change learning rate = 0.00015, the performance increase to 44.3?

The code is the same except using d2 for data loading and evaluation. Is that possible to train a larger batch size 32 to see the scalability of AnchorDETR?

tangjiuqi097 commented 3 years ago

You mention the performance is 41.6 when using batchsize 16. Using learning rate = 0.0002, the performance is 41.6, which is 2.7 AP lower than batchsize 8 and learning rate 0.0001. When you change learning rate = 0.00015, the performance increase to 44.3?

No, it is the R50-C5 model but not R50-DC5. The performance of the R50-C5 model in that version is 41.8.

tangjiuqi097 commented 3 years ago

The code is the same except using d2 for data loading and evaluation. Is that possible to train a larger batch size 32 to see the scalability of AnchorDETR?

Could you try our original code but not the d2 version for batchsize 32? You can try the command:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py  --lr 2e-4 --lr_backbone 2e-5 --batch_size 4 --coco_path /path/to/coco 
luohao123 commented 3 years ago

@yformer @tangjiuqi097 I found that AnchorDETR at the begaining converge fast, but after that, the AP can not goes up:

image

this is the AP at 179999 -> 28.9

image

this is the AP at 149999 -> 29.5

It seems stoped at 28.9 can not goes up anymore.....

is that normal? Me batchsize = 80, and lr is 0.00025, BTW I tried same bs set lr to 0.00035 and it become not able converge. (AP increase much more slower)

tangjiuqi097 commented 3 years ago

@luohao123 Hi, I am not sure whether the problem is due to the learning rate not being suitable for batch size or the code implementation. Can you try our default setting to find if your code can reproduce the performance?

luohao123 commented 3 years ago

@tangjiuqi097 I think models exactly same as anchordetr, and I fixed SetCriterion problem make it all same, as well as HugrianMachter. The finally problem must be lr schedualer.

But the lr schdualer, is same as DETR, but lr fom 0.0001 -> 0.00025.

Also, since the first serveral iteration shows fast converge speed, and actually it at least work, so maybe the model don't have any problem, the final issue could be the lr ..

yformer commented 3 years ago

@yformer @tangjiuqi097 I found that AnchorDETR at the begaining converge fast, but after that, the AP can not goes up:

image

this is the AP at 179999 -> 28.9

image

this is the AP at 149999 -> 29.5

It seems stoped at 28.9 can not goes up anymore.....

is that normal? Me batchsize = 80, and lr is 0.00025, BTW I tried same bs set lr to 0.00035 and it become not able converge. (AP increase much more slower)

Yeah, that is similar to what I observed. The code is very hard to scale for larger batchsize. I tried many different batchsize and learning rate. I could not obtain the reported performance.

yformer commented 3 years ago

@tangjiuqi097 , I tried batchsize = 16, 32, 64, 128. I could not obtain the reported performance. I wonder if you can run a larger batchsize (> 8) to reproduce the performance?

luohao123 commented 3 years ago

@yformer I think is not about batchsize, doesn means we using batchsize 8 can reproduces ops accuracy.

It has some deep problem hard to investigate, Currently, I trained about 250000, and the AP can not goes up anymore....

AP 30 is the highest score currently:

image

So the model actually only get AP 30??

tangjiuqi097 commented 3 years ago

@yformer I think is not about batchsize, doesn means we using batchsize 8 can reproduces ops accuracy.

It has some deep problem hard to investigate, Currently, I trained about 250000, and the AP can not goes up anymore....

AP 30 is the highest score currently:

image

So the model actually only get AP 30??

@luohao123 I think the d2 version you reproduced may not be right. Can you try our official code?

tangjiuqi097 commented 3 years ago

@tangjiuqi097 , I tried batchsize = 16, 32, 64, 128. I could not obtain the reported performance. I wonder if you can run a larger batchsize (> 8) to reproduce the performance?

@yformer What are the performances you trained? Do you use our official code or your reproduced d2 version code?

yformer commented 2 years ago

@tangjiuqi097 , I tried batchsize = 16, 32, 64, 128. I could not obtain the reported performance. I wonder if you can run a larger batchsize (> 8) to reproduce the performance?

@yformer What are the performances you trained? Do you use our official code or your reproduced d2 version code?

For batchsize 8 x 16 and learning rate 0.00025, the performance is 41.96. I used d2 version code, which can scale for a larger size.

luohao123 commented 2 years ago

@yformer What's your SOLVER configs? Can you share it? How do you make steps in Multisteps learning schedualer?

yformer commented 2 years ago

@luohao123 I trained the model with 50 epochs using batchsize 16 x 8. The initial learning rate is 0.00025 and decay the learning rate with 0.1 at 40 epochs. The optimizer is ADAMW with 10 warmup iterations.

tangjiuqi097 commented 2 years ago

@yformer This result is much higher than your previous results. It still has about 2 AP gaps to our results with batchsize 8, but I think the gap within 1AP should be more reasonable. I am concerned about your d2 version whether it can reproduce the performance even with the same batchsize. Have you solved the problem in #16? If you want to train the model with multiple nodes, you can follow the instruction in DETR instead of the d2 version. BTW, can you try the setting that 1 image per card, 8 cards per node, and multiple nodes? I wonder if it is related to the padding region.

luohao123 commented 2 years ago

@yformer that make no sense, d2 doesn't have a concept of epoch, it's all iterations. Can you share your d2 config file?

luohao123 commented 2 years ago

@tangjiuqi097 I just curious about why don't your experiment on bs > 8 on your side? what's that performance?

yformer commented 2 years ago

@luohao123 You can convert it to iterations since they are equivalent. My implementation is scaled for multiple machines. So that is why I shared the epochs. The equivalent iterations is 46200.

IMS_PER_BATCH: 128 BASE_LR: 0.00025

STEPS: (36960,)
MAX_ITER: 46200

Other hyperparameters are the same as those you shared.

yformer commented 2 years ago

@yformer This result is much higher than your previous results. It still has about 2 AP gaps to our results with batchsize 8, but I think the gap within 1AP should be more reasonable. I am concerned about your d2 version whether it can reproduce the performance even with the same batchsize. Have you solved the problem in #16? If you want to train the model with multiple nodes, you can follow the instruction in DETR instead of the d2 version. BTW, can you try the setting that 1 image per card, 8 cards per node, and multiple nodes? I wonder if it is related to the padding region.

For batchsize = 8, the result can be reproduced. I did not resolve that issue. I do not know why it happened. Sometimes it did not show up. I tried 8x8 images per batch, the result is quite similar to 16x8 images per batch.

luohao123 commented 2 years ago

@yformer bs=8 means 1img/per gpu? Or 8 image per gpu?

You are not possible training 46200 get AP 42. You need at least 200000 iterations to get a resonable performance.

yformer commented 2 years ago

@luohao123 , bs = 8 means 1 img/per gpu. It trains more than 200K iterations. For 46200, it is 16 x 8 images per batch.

luohao123 commented 2 years ago

@yformer Can u share your d2 code train AnchorDETR? Or your config? I currenty using bs=80, after 200k, I only get mAP 36

tangjiuqi097 commented 2 years ago

@luohao123 Currently, I do not have enough cards for these experiments with large batches.

tangjiuqi097 commented 2 years ago

@yformer It should be too aggressive to increase the batchsize from 8 to 128. Especially, the Adamw optimizer does not obey the linear scaling rule well. I suggest using batchsize 16 or 32 first.

yformer commented 2 years ago

@yformer It should be too aggressive to increase the batchsize from 8 to 128. Especially, the Adamw optimizer does not obey the linear scaling rule well. I suggest using batchsize 16 or 32 first.

@tangjiuqi097, based on my experiments, it seems the anchordetr does not scale well with large batch size. Coupling with other heads instead of anchor-detr, the performance of my implementation is very consistent. Any ideas why it happens, e.g., the approximate self-attention?

tangjiuqi097 commented 2 years ago

@yformer You can try to use the standard attention by setting --attention_type nn.MultiheadAttention to investigate if the standard attention has this phenomenon.

github-actions[bot] commented 2 years ago

This issue is not active for a long time and it will be closed in 5 days. Feel free to re-open it if you have further concerns.