About benign25k dataset evaluation

Fujiaoji commented 5 months ago

Hi Lin, Thanks for your sharing the dataset. I have a question for using this dataset. How do you evaluate different models on benign25k dataset? For example, for the EMD, you only use the detection number which means how many images higher than the threshold in the benign dataset. This number should be considered as true negative number, right? However, in the benign25k dataset, there are not too many folders in the target list, so they should not higher than this threshold. In this case, the samples lower than the threshold should be considered as true positive? And for the comparison, do you use the domain information for EMD, etc, or only based on similarity? Thanks.

lindsey98 commented 5 months ago

Hi Fujiao,

In Figure 11, we report the TPR (True Positive Rate) and FPR (False Positive Rate) for different thresholds of similarity. If a benign webpage/logo exceeds the threshold for any brand on the target list, it is reported as a false positive (benign but incorrectly reported as phishing). If the benign page does not match any of the brands on the target list, it is reported as a true negative.

We based our analysis solely on similarity; we did not further check for domain consistency. However, we observe that the EMD often matches the benign pages to incorrect targets (EMD is based on the screenshot color distribution). Therefore, introducing a domain check will likely not reduce the FPR problem.

Fujiaoji commented 4 months ago

Hi Fujiao,

In Figure 11, we report the TPR (True Positive Rate) and FPR (False Positive Rate) for different thresholds of similarity. If a benign webpage/logo exceeds the threshold for any brand on the target list, it is reported as a false positive (benign but incorrectly reported as phishing). If the benign page does not match any of the brands on the target list, it is reported as a true negative.

We based our analysis solely on similarity; we did not further check for domain consistency. However, we observe that the EMD often matches the benign pages to incorrect targets (EMD is based on the screenshot color distribution). Therefore, introducing a domain check will likely not reduce the FPR problem.

Thanks for your reply. Yeah, only based on threshold will cause high false positive(predict benign as phishing). The thing is that I think the reference list of logos should be the benign logos, but when comparing similarity between benign websites and the reference list logo, then the similarity will absolutely higher than the threshold, this will cause a lot of false positive.......Do I misunderstand something?

lindsey98 commented 4 months ago

Most of the benign 25k are not in the reference list. Our reference list only includes 277 brands.

We observe that EMD often matches benign pages to irrelevant brands. The high FPR of EMD is because of its low logo-matching accuracy (based on the optimal transport optimization), not because the benign is in reference list.

But our PhishIntention solution does not report high FPR, because it will not match the logo to an incorrect target, and it has a validation mechanism for credential-taking intention.

Fujiaoji commented 4 months ago

Most of the benign 25k are not in the reference list. Our reference list only includes 277 brands.

We observe that EMD often matches benign pages to irrelevant brands. The high FPR of EMD is because of its low logo-matching accuracy (based on the optimal transport optimization), not because the benign is in reference list.

But our PhishIntention solution does not report high FPR, because it will not match the logo to an incorrect target, and it has a validation mechanism for credential-taking intention.

Yeah, I see, got it. Thanks for your patience and reply.

lindsey98 / PhishIntention

About benign25k dataset evaluation #26