lindsey98 / PhishIntention

PhishIntention: Phishing detection through webpage intention
MIT License
47 stars 12 forks source link

Question regarding the 9010 set #17

Open HCY123902 opened 1 year ago

HCY123902 commented 1 year ago

Sorry to bother you. I currently try to train a new CRP classifier with text input, and I would like to check whether it is possible to use part of your training samples. Can I ask whether there is any HTML for each of the 9010 samples used for your CRP classifier?

image


If possible, I also would like to check how the 9010 samples are taken from the Phishpedia dataset. I tried to use domain name, such as 12tv here, to match each sample page to a page in the original sets phish_sample_30k and benign_sample_30k. However, it seems there is no exact domain match for most of the CRP samples.

image

Among those sample pages that have a domain match in the original Phishpedia dataset, the screenshots between the CRP sample and the original sample are different. An example is with the domain name 360converter. Its screenshot in the 9010 set indicates that the sample is not a CRP.

image

However, its screenshot in the original set benign_sample_30k shows that the sample is a CRP.

image

Can I ask how to match the 9010 samples to the original samples in the Phishpedia evaluation sets in this case? I look forward to receiving your reply.