waxnkw / IETrans-SGG.pytorch

This is the code of ECCV 2022 (Oral) paper "Fine-Grained Scene Graph Generation with Data Transfer".
Other
96 stars 7 forks source link

SGCLS #20

Open 184446223 opened 1 year ago

184446223 commented 1 year ago

The output is not in the same order as the input boxes and labels. How do you ensure the order of inputs and outputs?

waxnkw commented 1 year ago

Which program are you running? Demo or the normal model test? If it is convenient, can you provide more information like the screenshot of the problem and the results you expect.

184446223 commented 1 year ago

您好,感谢您的回复 我是用了在demo.md 中的SGCLS,我输入的是图像和我自己的bounding boxes

首先运行命令: bash cmds/50/transformer/demo_sgcls.sh visualization/demo_imgs/

对于我输入的图像: Label 顺序是[222.0-grass, 119.0-hair, 594.0-pants, 98.0-shirt, 51.0-shoe, 242.0-man, 50.0-bush, 594.0-hand, 372.0, 51.0, 465.0, 291.0, 594.0, 222.0, 47.0, 594.0, 51.0, 177.0, 532.0, 119.0, 90.0, 600.0, 594.0, 50.0, 291.0, 291.0, 724.0, 594.0, 724.0, 314.0, 72.0, 594.0, 594.0, 372.0, 50.0, 594.0], 对于预测得到的label:[shirt,shirt,hair,hair,man,…] Label的顺序是对应不上的。

我的label 是预训练好的Faster-rcnn提取得到的(xxyy),您的bbox 是(xywh)嘛? 所以我应该注释掉下面的/devdata/sxq/IETrans-SGG.pytorch/maskrcnn_benchmark/data/datasetsVisual-genonme.py代码吗? box = self.custom_bboxes[os.path.basename(self.custom_files[index])][0] box = torch.from_numpy(box).reshape(-1, 4) # guard against no boxes 对于我的box是: [array([[ 35.93379593, 404.48034668, 328.89007568, 499.5 ], [188.82444763, 113.27613068, 238.16075134, 145.34199524], [ 0. , 119.37953949, 185.57318115, 427.5696106 ], [178.05912781, 230.6295929 , 259.18023682, 332.79406738], [206.31604004, 153.4493866 , 257.67294312, 246.56256104], [225.91688538, 334.0105896 , 263.7336731 , 367.63504028], [141.13349915, 161.30140686, 280.99633789, 340.75933838], [222.65255737, 0. , 332.44500732, 380.91497803], [160.4355011 , 193.69120789, 181.72399902, 213.33145142], [190.19726562, 151.82095337, 260.29699707, 238.40953064], [106.64482117, 282.39590454, 332.44500732, 499.5 ], [ 0. , 0. , 190.92829895, 273.30636597], [ 0. , 0. , 268.98171997, 263.15631104], [ 0. , 229.88882446, 275.33688354, 499.5 ], [202.82028198, 232.92129517, 253.36407471, 347.34350586], [275.90008545, 214.25027466, 332.44500732, 331.80636597], [160.6545105 , 130.32173157, 286.07989502, 267.30819702], [ 95.13780212, 0. , 280.80758667, 277.99682617], [ 9.68846607, 197.34692383, 332.44500732, 451.08660889], [176.88702393, 108.68948364, 240.25912476, 163.06900024], [154.71687317, 104.48048401, 242.10623169, 325.04943848], [252.24610901, 414.94189453, 324.35147095, 475.68737793], [204.82348633, 0. , 332.44500732, 297.49407959], [182.55186462, 128.4055481 , 268.1741333 , 335.48556519], [ 0. , 0. , 175.06044006, 174.50291443], [238.15716553, 6.93290997, 332.44500732, 223.21194458], [120.29826355, 314.39614868, 297.75561523, 416.58966064], [ 0. , 48.12928009, 332.44500732, 321.48184204], [181.61392212, 219.59176636, 332.44500732, 499.5 ], [156.070755 , 82.52002716, 247.61343384, 251.04698181], [ 47.03872299, 0. , 332.44500732, 95.03981781], [ 16.86422157, 238.12612915, 132.80288696, 451.29125977], [ 60.7062912 , 0. , 332.44500732, 272.66577148], [155.28938293, 190.64122009, 187.01828003, 217.20217896], [130.56956482, 62.13659286, 310.03619385, 404.93789673], [ 9.8139286 , 217.02027893, 154.49185181, 347.40713501]]) 输出的box是: tensor([[342.6978, 273.2777, 469.0036, 429.1371], [371.7406, 276.2089, 464.2756, 443.8126], [ 0.0000, 0.0000, 344.0150, 491.9514], [340.2242, 203.8970, 429.1185, 261.6156], [318.7154, 195.6411, 432.8993, 293.5242], [278.7691, 188.0649, 436.2274, 585.0890], [ 0.0000, 0.0000, 484.6517, 473.6813], [ 30.3860, 428.6270, 239.2845, 812.3242], [369.0513, 0.0000, 599.0000, 535.4893], [ 0.0000, 214.8832, 334.3661, 769.6253], [ 0.0000, 0.0000, 315.4242, 314.1052], [365.4420, 419.2583, 456.5118, 625.2183], [429.1120, 12.4792, 599.0000, 401.7815], [ 17.6828, 390.6365, 278.3637, 625.3328], [401.1758, 0.0000, 599.0000, 685.6470], [328.9223, 231.1300, 483.1966, 603.8740], [ 0.0000, 86.6327, 599.0000, 578.6673], [289.0730, 348.6442, 327.4306, 383.9966], [407.0575, 601.2191, 475.1958, 661.7430], [109.3807, 0.0000, 599.0000, 490.7984], [281.2086, 148.5360, 446.1503, 451.8846], [ 0.0000, 413.7999, 496.1025, 899.1000], [235.2605, 111.8459, 558.6238, 728.8882], [254.2946, 290.3425, 506.2997, 613.3668], [279.8007, 343.1542, 336.9699, 390.9639], [497.1173, 385.6505, 599.0000, 597.2515], [171.4195, 0.0000, 505.9596, 500.3943], [289.4676, 234.5791, 515.4593, 481.1548], [ 17.4567, 355.2245, 599.0000, 811.9559], [320.8273, 415.1333, 466.9914, 599.0293], [216.7536, 565.9130, 536.4966, 749.8614], [454.4975, 746.8954, 584.4171, 856.2372], [ 64.7456, 728.0646, 592.5947, 899.1000], [192.1528, 508.3126, 599.0000, 899.1000], [ 84.7545, 0.0000, 599.0000, 171.0717], [327.2323, 395.2652, 599.0000, 899.1000]]), bbox也对应不上了。 我认为是bbox格式问题,我应该由修改代码处理xxyy格式的数据而不是xyhw?这样是否就一致了?

184446223 commented 1 year ago

我还有两个问题: 1这关系是不是会大概两两region之间输出一个,在关系很多的条件下如何筛选出有用的信息?每个关系的得分好像都不是特别高如何选择? 2.对于VG-50和CG-1000那个性能更好一些呢?如何选择模型。

184446223 commented 1 year ago

还有就是SGCLs可以提出属性信息吗?动作,颜色等

waxnkw commented 1 year ago
  1. 首先bbox坐标确实是 [x1, y1, x2, y2],不应该是[x, y, w, h]。
  2. 然后SGCLS是会自己预测框的label的,所以预测的应该和输入的不完全一样。
  3. 做预测的时候模型会做resize输入的图片。如果图片被resize了,输入的框的坐标也会被resize,所以输出的框也是resize之后的。你可以根据visualization的代码(会resize图片来适配框)可视化一下结果正不正确。

总结一下:你可以先把坐标改成[xyxy]格式。然后预测完之后用visualization的代码可视化一下结果正确与否。

  1. 对的,大概两两region之间输出一个。我看到常用选取方法:一般是取top-k(比如top 30)。我的可视化demo里也是这么写的。
  2. VG-50更准确一些,但很多关系没啥信息量,比如 on, in之类的。VG-1800不大准确(但也还行),但提取的关系比较多样化,比如 (9-motorcycle, casting, 12-shadow) 摩托车投射影子。
  3. 不能提取属性信息,我没训练属性。
184446223 commented 1 year ago

所以SGCL是bbox跟原来输入的bbox顺序应该是一样的吗?label我知道不一样,但是不知道为啥语义偏差很大,我怀疑是不是乱序了,应该不会乱序吗,我再按您的意思试一试,感谢回复!

---Original--- From: @.> Date: Tue, Jul 25, 2023 17:01 PM To: @.>; Cc: @.**@.>; Subject: Re: [waxnkw/IETrans-SGG.pytorch] SGCLS (Issue #20)

首先bbox坐标确实是 [x1, y1, x2, y2],不应该是[x, y, w, h]。

然后SGCLS是会自己预测框的label的,所以预测的应该和输入的不完全一样。

做预测的时候模型会做resize输入的图片。如果图片被resize了,输入的框的坐标也会被resize,所以输出的框也是resize之后的。你可以根据visualization的代码(会resize图片来适配框)可视化一下结果正不正确。 总结一下:你可以先把坐标改成[xyxy]格式。然后预测完之后用visualization的代码可视化一下结果正确与否。

对的,大概两两region之间输出一个。我看到常用选取方法:一般是取top-k(比如top 30)。我的可视化demo里也是这么写的。

VG-50更准确一些,但很多关系没啥信息量,比如 on, in之类的。VG-1800不大准确(但也还行),但提取的关系比较多样化,比如 (9-motorcycle, casting, 12-shadow) 摩托车投射影子。

不能提取属性信息,我没训练属性。

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

waxnkw commented 1 year ago

顺序应该是一样的,但具体的值可能会变。比如图片本身是[600, 800],但被模型resize成了 [1200, 1600]。这种情况下,每个框的坐标也会乘2,比如原来是[0, 0, 30, 40]的框,坐标会变成[0,0,60,80]。