s0md3v / Photon

Incredibly fast crawler designed for OSINT.
GNU General Public License v3.0
10.96k stars 1.49k forks source link

Multiple improvements #124

Closed oXis closed 5 years ago

oXis commented 5 years ago

The intels are searched only inside the page plain text to avoid retrieving tokens are garbage javascript code. Better regular expressions could allow searching inside javascript code for intels though.

v1.3.0

Tested on

os: Linux Mint 19.1
python: 3.6.7
s0md3v commented 5 years ago

Hi there,

Thanks for the contribution. I am sorry that I caused some conflicts with your PR with my recent commits.

I was on a break so this project was running purely on public contributions but looks like that ended up breaking the project.

Now, when I had the time I made it stable again while dropping support for Python < 3.2 and removing Ninja mode.

If you can resolve the conflicts (there aren't much), that would be great otherwise I can do it myself (but do it yourself since it's your code so you know it better).

Thanks again.

oXis commented 5 years ago

Hello :)

Conflicts are fixed. I reverted intels to bad_intels because I needed it for filtering bad matches (credit cards).

I recorded your changes in the changelog.

Do you want me to add more stuff?

About dropping ninja mode, now that proxies are supported, you can used Photon with alpine-tor (https://github.com/zet4/alpine-tor). It's a rotating tor proxy, maybe you can add a note in the doc. Also, intels can be extracted from Tor hidden services.

s0md3v commented 5 years ago

I would like to have your opinion on this. Check my implementation of proxies in my other project, XSStrike.

As Photon is a crawler, it makes a lot of sense to give it the ability to rotate proxies. We can do that right now by modifying XSStrike's implementation to accept both single proxies and a list.

What do you say?

oXis commented 5 years ago

The --proxy argument can be changed to accept coma separated proxies (like 0.0.0.1:1337,0.0.0.2:1337) but if the user wants to rotate on many proxies it will be impractical to use this technique.

So maybe the argument can take a file with a list of proxies (column) and rotate through that list with a random.choice. But we need to validate all the proxies first, because requests will not tell you that the proxy doesn't work.

I wrote some code to fetch proxies from the web.

import re
import requests
import urllib

reponse = requests.get("https://free-proxy-list.net/anonymous-proxy.html")

regex = re.compile(
    r'<td>([0-9]+.[0-9]+.[0-9]+.[0-9]+)</td><td>([0-9]+)</td>')

proxy_pool = []
for match in regex.findall(reponse.text):
    proxy_pool.append(f"{match[0]}:{match[1]}")

def is_good_proxy(pip):
    try:
        proxy_handler = urllib.request.ProxyHandler({'http': pip})
        opener = urllib.request.build_opener(proxy_handler)
        opener.addheaders = [('User-agent', 'Mozilla/5.0')]
        urllib.request.install_opener(opener)
        # change the URL to test here
        req = urllib.request.Request('http://www.example.com')
        sock = urllib.request.urlopen(req, timeout=5)
    except urllib.error.HTTPError as e:
        return False
    except Exception as detail:
        return False
    return True

working_proxies = []
for proxy in proxy_pool:
    if is_good_proxy(proxy):
        working_proxies.append(proxy)
s0md3v commented 5 years ago

Thanks @oXis , Just take a look at the code I reviewed and we can merge it.

oXis commented 5 years ago

I'll work on the rotating proxy on Monday.

oXis commented 5 years ago

Why is archives.org timeing out? That's weird.

s0md3v commented 5 years ago

It isn't archive.org that was timing out. We are now using my personal website somdev.me for error free and faster checking. Hang on.