less-and-less-bugs / LogicMD

The offical code implementation of paper "Interpretable Multimodal Misinformation Detection with Logic Reasoning", accepted by Finding of ACL 23.
22 stars 2 forks source link

我们使用了提供的数据,其余代码没有改动的情况下,在Twitter中出现了NaN; 与论文中效果差距较大; 数据集数量匹配问题; #3

Open SmildWind opened 2 weeks ago

SmildWind commented 2 weeks ago

您好!非常棒的工作!我们在复现这个工作的时候,code正常运行了,但是在models/rule_detection.py文件中的class RFND定义代码里的E_ = torch.stack(E_, dim=0).contiguous() * values中出现了大量的NaN,进而导致了损失全部为NaN,具体原因似乎是因为torch.stack(E_, dim=0).contiguous()中存在0,而values中存在-inf0*inf=NaN,我们尝试使用如下代码试图替换NaN:

E_stacked = torch.stack(E_, dim=0).contiguous()
# 将 values 扩展成与 E_stacked 相同的形状
values_expanded = values.unsqueeze(-1)
zero_mask = E_stacked == 0
inf_mask = values_expanded == -float('inf')
# 使用0代替-inf以避免NaN
values_expanded[inf_mask] = 0
result = E_stacked * values_expanded
E_ = result

目前不知道为何会存在0*inf=NaN,数据集我们用的Twitter_Set, 设置text_path='dataset/twitter/texts/twitter_final2.json'以及img_path=dataset/twitter/embedding.pt

除了使用 values_expanded[inf_mask] = 0 以外,我们还尝试了 values_expanded[inf_mask] = -3276800以近似-inf, 结果离论文中的0.911还是差距比较大,目前不知道具体原因,如果我们的实现有哪里有问题还烦请您指出!

训练结果如下:

# 训练结果 values_expanded[inf_mask] = 0
Train Epoch 0: Time 472.6773, Acc: 0.8900, Loss: 1.4030, Rumor_R: 0.8729, Rumor_P: 0.9233, Rumor_F: 0.8974, Non_Rumor_R: 0.9110, Non_Rumor_P: 0.8537, Non_Rumor_F1: 0.8814
Test: Time: 17.2103, Acc: 0.3280, Loss: 0.7011, Rumor_R: 0.6040, Rumor_P: 0.2672, Rumor_F: 0.3705, Non_Rumor_R: 0.1936, Non_Rumor_P: 0.5011, Non_Rumor_F1: 0.2793

Train Epoch 1: Time 497.7716, Acc: 0.9885, Loss: 1.0146, Rumor_R: 0.9863, Rumor_P: 0.9928, Rumor_F: 0.9895, Non_Rumor_R: 0.9912, Non_Rumor_P: 0.9833, Non_Rumor_F1: 0.9872
Test: Time: 14.9006, Acc: 0.3309, Loss: 0.7000, Rumor_R: 0.6113, Rumor_P: 0.2697, Rumor_F: 0.3743, Non_Rumor_R: 0.1945, Non_Rumor_P: 0.5069, Non_Rumor_F1: 0.2811

Train Epoch 2: Time 519.4532, Acc: 0.9884, Loss: 1.0179, Rumor_R: 0.9861, Rumor_P: 0.9928, Rumor_F: 0.9894, Non_Rumor_R: 0.9912, Non_Rumor_P: 0.9831, Non_Rumor_F1: 0.9871
Test: Time: 15.9966, Acc: 0.3202, Loss: 0.6990, Rumor_R: 0.5055, Rumor_P: 0.2421, Rumor_F: 0.3274, Non_Rumor_R: 0.2300, Non_Rumor_P: 0.4887, Non_Rumor_F1: 0.3128

Train Epoch 3: Time 513.4600, Acc: 0.9884, Loss: 1.0060, Rumor_R: 0.9865, Rumor_P: 0.9924, Rumor_F: 0.9894, Non_Rumor_R: 0.9907, Non_Rumor_P: 0.9835, Non_Rumor_F1: 0.9871
Test: Time: 15.7953, Acc: 0.2820, Loss: 0.7016, Rumor_R: 0.5949, Rumor_P: 0.2496, Rumor_F: 0.3517, Non_Rumor_R: 0.1297, Non_Rumor_P: 0.3967, Non_Rumor_F1: 0.1954

Train Epoch 4: Time 520.3489, Acc: 0.9901, Loss: 1.0017, Rumor_R: 0.9850, Rumor_P: 0.9970, Rumor_F: 0.9910, Non_Rumor_R: 0.9964, Non_Rumor_P: 0.9819, Non_Rumor_F1: 0.9891
Test: Time: 15.7334, Acc: 0.2885, Loss: 0.7014, Rumor_R: 0.6095, Rumor_P: 0.2548, Rumor_F: 0.3593, Non_Rumor_R: 0.1323, Non_Rumor_P: 0.4105, Non_Rumor_F1: 0.2001
# 训练结果 values_expanded[inf_mask] = -327800
Train Epoch 0: Time 528.8097, Acc: 0.8892, Loss: 1.3516, Rumor_R: 0.9241, Rumor_P: 0.8807, Rumor_F: 0.9019, Non_Rumor_R: 0.8463, Non_Rumor_P: 0.9009, Non_Rumor_F1: 0.8727
Test: Time: 16.1896, Acc: 0.3566, Loss: 0.6983, Rumor_R: 0.6204, Rumor_P: 0.2812, Rumor_F: 0.3870, Non_Rumor_R: 0.2282, Non_Rumor_P: 0.5527, Non_Rumor_F1: 0.3231

Train Epoch 1: Time 518.5146, Acc: 0.9955, Loss: 1.0037, Rumor_R: 0.9949, Rumor_P: 0.9968, Rumor_F: 0.9959, Non_Rumor_R: 0.9961, Non_Rumor_P: 0.9938, Non_Rumor_F1: 0.9950
Test: Time: 15.6606, Acc: 0.2963, Loss: 0.6963, Rumor_R: 0.4526, Rumor_P: 0.2202, Rumor_F: 0.2963, Non_Rumor_R: 0.2202, Non_Rumor_P: 0.4526, Non_Rumor_F1: 0.2963

Train Epoch 2: Time 515.2467, Acc: 0.9977, Loss: 0.9867, Rumor_R: 0.9977, Rumor_P: 0.9981, Rumor_F: 0.9979, Non_Rumor_R: 0.9977, Non_Rumor_P: 0.9972, Non_Rumor_F1: 0.9974
Test: Time: 15.6808, Acc: 0.2963, Loss: 0.6985, Rumor_R: 0.5036, Rumor_P: 0.2335, Rumor_F: 0.3191, Non_Rumor_R: 0.1954, Non_Rumor_P: 0.4472, Non_Rumor_F1: 0.2719

另外一个问题是,Twitter数据集在论文中描述:

Twitter (Boididou et al., 2018) contains 7334 rumors and 5599 non-rumors for training and 564 rumors and 427 non-rumors for testing.

似乎这个train跟test的数量跟dataset/twitter/texts/twitter_final2.jsondata[:8617]data[8617:] 不相同,这是因为我们用错了文件还是哪里没有弄对呢?可以劳烦告知一下吗?

less-and-less-bugs commented 2 weeks ago

这种情况应该是梯度爆炸造成的,可能是超参数设置或者其它原因造成的,需要打印出梯度看一下。请问你在Weibo, Sarcasm数据集上有遇到这个问题吗?是否用的是最新的代码呢?

数据集以dataset文件里的为准。论文汇报的可能是原始数据集的大小。