s0md3v / Photon

Incredibly fast crawler designed for OSINT.
GNU General Public License v3.0
10.96k stars 1.49k forks source link

Option to skip crawling of URLs that match a regex pattern #40

Closed connorskees closed 6 years ago

connorskees commented 6 years ago

Added option to skip crawling of URLs that match a regex pattern, changed handling of seeds, and removed double spaces.

I am assuming this is what you meant in the ideas column tell me if I'm way off though :)

s0md3v commented 6 years ago

Thanks for fixing all that stuff :heart:

I will merge the pull request as soon as I finish evaluating and testing the code.

Meanwhile, can you please mention the changes in the changes variable in photon.py?

s0md3v commented 6 years ago

is_link() function accepts a string as an argument i.e. a single URL. While your parse() function accepts a list of URLs. But what's the point of accepting a list when you have only a single URL available inside the is_link() function?

connorskees commented 6 years ago

There isn't really a point in having it accept a list of urls. I originally wrote the function thinking it would be called on storage before the links in storage were crawled. I instead used it with is_link() and did not edit the function.

s0md3v commented 6 years ago

How about this? I guess this is more efficient :)

connorskees commented 6 years ago

Is there a reason to use a regex for finding a tags over bs4?

s0md3v commented 6 years ago

bs4 is slower.

s0md3v commented 6 years ago

It's not working as intended, can you please check what's wrong?

connorskees commented 6 years ago

It was adding urls from zap. Now, doing -- exclude .* gives

URLs retrieved from robots.txt: 0 Level 1: 1 URLs Progress: 1/1 Crawling 15 JavaScript files Progress: 15/15

URLs: 1 Intel: 0 Files: 0 Endpoints: 0 Fuzzable URLs: 0 Custom strings: 0 JavaScript Files: 15 External References: 0

s0md3v commented 6 years ago

It doesn't seem to work :/

I have a 3 level deep mirror of stackoverflow on my localhost where I test Photon.

screenshot_2018-08-01_23-58-16

s0md3v commented 6 years ago

It's confusing! It's working perfectly with the real stackoverflow. Thanks for everything. I will merge it in 5.

s0md3v commented 6 years ago

Alright, I found the bug. It doesn't work if the website doesn't have a sitemap.xml or robots.txt

screenshot_2018-08-02_00-04-34

s0md3v commented 6 years ago

Checking every URL against the regex is a bad idea because it will reduce performance. Can't you implement it in extractor where links are returned so we can quickly return the ones which don't match?

s0md3v commented 6 years ago

It works perfectly now but I have one more change to make but I will make it myself because I have bothered you enough already. Thanks for everything bro :heart: You rock!

connorskees commented 6 years ago

sorry, just wondering what about the excluding feature was poorly implemented (for my own reference)

s0md3v commented 6 years ago

I suggested that checking URLs one by one is slow and that's why I asked you to move it from is_link() to extractor but you still put it in a place where the URLs will be checked one by i.e. in a for loop.

I didn't want to bother you anymore & it would have been unfair if I didn't merge your request after this much of hard work so I kept the re-factor and merged it.

However, I have now realized the best place to to exclude the URLs is this.

connorskees commented 6 years ago

I'm really interested now- how can you exclude the urls matching the regex without checking them all one by one?

It's totally fine bothering me :) this is a great learning experience

s0md3v commented 6 years ago

It's not really about matching them one by one, it's about the optimal location to so. If you match them while our precious http requests are being made, you will end up eating their time. That's why I suggested another location in the previous comment.