lindsey98 / PhishIntention

PhishIntention: Phishing detection through webpage intention
MIT License
45 stars 12 forks source link

Unmatched Test Result Number #7

Open imethanlee opened 2 years ago

imethanlee commented 2 years ago

Hi,

I ran the PhishIntention on 25K benign webpage dataset, which contains 25400 benign webpages. However, the output test results file only contains the results of 21813 webpages. I ran it several times but the output results number remained the same. Is it an expected outcome or something might go wrong?

P.S. The number matches when I test the algorithm on 25K CRP phishing webpage dataset. (25403 input webpages, 25403 output results)

Thanks in advance.

lindsey98 commented 2 years ago

Hi, I think it is because some of the benign websites do not contain info.txt file, in that case, kindly replace https://github.com/lindsey98/PhishIntention/blob/main/phishintention/phishintention_main.py#L169-L172 with the following code:

info_path = os.path.join(full_path, 'info.txt')
if not os.path.exists(screenshot_path):  # screenshot not exist
   continue
try:
     url = open(info_path, encoding='ISO-8859-1').read()
except:
     url = 'https://www' + item
lindsey98 commented 2 years ago

By the way, for the ROC curve, we didn't run the Step 4: Dynamic analysis part, here is the code we use: https://github.com/lindsey98/PhishIntention/blob/main/phishintention/src/pipeline_eval.py#L20

imethanlee commented 2 years ago

Hi, I run 'run.py' based on the newest version of code on benign_25k dataset. This time it generates 25184 results, which is a bit less than 25400. Is it an expected outcome?

imethanlee commented 2 years ago

Problem solved.