Extremely different evaluation results on Visual Genome using RelDN

smichniak commented 2 years ago

After evaluating pretrained RelDN model on Visual Genome dataset we get the following results: {"danfei_metric": {"sgdet20": 0.03587473726452877, "sgdet50": 0.059597123157443185, "sgdet100": 0.08203152214357358}, "rowan_metric": {"sgdet20": 0.03556429024402322, "sgdet50": 0.05944168796383135, "sgdet100": 0.08196929203549849}}

Those metrics are extremely different from those listed in the model zoo file.

model	sgdet@20	sgdet@50	sgdet@100	sgcls@20	sgcls@50	sgcls@100	predcls@20	predcls@50	predcls@100	model	config
RelDN	24.0	32.4	37.8	31.9	35.7	36.6	54.0	60.9	62.5	link	link

The testing is run using the following command:

python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/test_sg_net.py --config-file sgg_configs/vg_vrd/rel_danfeiX_FPN50_reldn.yaml TEST.IMS_PER_BATCH 4

with NGPUS=4. Pretrained model is downloaded from here and is located in

models/vgvrd/vgnm_usefpTrue_objctx0_edgectx2

Attached are the output logs of the evaluation: out.txt

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6612/6612 [13:31<00:00,  8.15it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6612/6612 [13:38<00:00,  8.08it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6612/6612 [13:43<00:00,  8.03it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6612/6612 [16:56<00:00,  6.51it/s]
INFO:maskrcnn_benchmark.inference:Total run time: 0:16:59.859915 (0.1542554510745917 s / img per device, on 4 devices)
INFO:maskrcnn_benchmark.inference:Total run time: 0:17:00.009417 (0.15427806349824075 s / img per device, on 4 devices)
INFO:maskrcnn_benchmark.inference:Total run time: 0:17:00.276712 (0.15431849238714399 s / img per device, on 4 devices)
INFO:maskrcnn_benchmark.inference:Model inference time: 0:12:33.692396 (0.11399718610383754 s / img per device, on 4 devices)
INFO:maskrcnn_benchmark.inference:Model inference time: 0:12:44.732291 (0.11566698800901741 s / img per device, on 4 devices)
INFO:maskrcnn_benchmark.inference:Total run time: 0:16:56.360049 (0.15372609075237437 s / img per device, on 4 devices)
INFO:maskrcnn_benchmark.inference:Model inference time: 0:12:39.823767 (0.11492456589633665 s / img per device, on 4 devices)
INFO:maskrcnn_benchmark.inference:Model inference time: 0:15:58.514275 (0.14497682441158102 s / img per device, on 4 devices)
INFO:maskrcnn_benchmark.inference:Convert prediction results to tsv format and save.
WARNING:scene_graph_generation.inference:performing scene graph evaluation.
WARNING:scene_graph_generation.inference:===================sgdet(motif)=========================
WARNING:scene_graph_generation.inference:sgdet-recall@20: 0.035564
WARNING:scene_graph_generation.inference:sgdet-recall@50: 0.059442
WARNING:scene_graph_generation.inference:sgdet-recall@100: 0.081969
WARNING:scene_graph_generation.inference:=====================sgdet(IMP)=========================
WARNING:scene_graph_generation.inference:sgdet-recall@20: 0.03587473726452877
WARNING:scene_graph_generation.inference:sgdet-recall@50: 0.059597123157443185
WARNING:scene_graph_generation.inference:sgdet-recall@100: 0.08203152214357358

DavidHuji commented 2 years ago

I have the same issue, from my analysis it looks like the problem is in the weights of the relation head because the RPN proposals look good.

smichniak commented 2 years ago

From what I see the head weights are properly loaded from the checkpoint. Do you think the provided pretrained model weights themselves could be wrong and the listed results come from evaluation on a different set of weights?

DavidHuji commented 2 years ago

Yup, that is my guess, but I also tried with other models (nuralMotif and grcnn) and also got similar results, so it seems unlikely that they uploaded the wrong weights for all of them.

ChenCongGit commented 2 years ago

Have you solved this problem?

zhuang-li commented 2 years ago

I got the same result. It seems odd.

VSJMilewski commented 1 year ago

Any update on this? I just ran this and got the same results

microsoft / scene_graph_benchmark

Extremely different evaluation results on Visual Genome using RelDN #74