Closed oXis closed 5 years ago
Hi there,
Thanks for the contribution. I am sorry that I caused some conflicts with your PR with my recent commits.
I was on a break so this project was running purely on public contributions but looks like that ended up breaking the project.
Now, when I had the time I made it stable again while dropping support for Python < 3.2 and removing Ninja mode.
If you can resolve the conflicts (there aren't much), that would be great otherwise I can do it myself (but do it yourself since it's your code so you know it better).
Thanks again.
Hello :)
Conflicts are fixed. I reverted intels
to bad_intels
because I needed it for filtering bad matches (credit cards).
I recorded your changes in the changelog.
Do you want me to add more stuff?
About dropping ninja mode, now that proxies are supported, you can used Photon with alpine-tor
(https://github.com/zet4/alpine-tor). It's a rotating tor proxy, maybe you can add a note in the doc. Also, intels can be extracted from Tor hidden services.
I would like to have your opinion on this. Check my implementation of proxies in my other project, XSStrike.
As Photon is a crawler, it makes a lot of sense to give it the ability to rotate proxies. We can do that right now by modifying XSStrike's implementation to accept both single proxies and a list.
What do you say?
The --proxy
argument can be changed to accept coma separated proxies (like 0.0.0.1:1337,0.0.0.2:1337) but if the user wants to rotate on many proxies it will be impractical to use this technique.
So maybe the argument can take a file with a list of proxies (column) and rotate through that list with a random.choice
. But we need to validate all the proxies first, because requests
will not tell you that the proxy doesn't work.
I wrote some code to fetch proxies from the web.
import re
import requests
import urllib
reponse = requests.get("https://free-proxy-list.net/anonymous-proxy.html")
regex = re.compile(
r'<td>([0-9]+.[0-9]+.[0-9]+.[0-9]+)</td><td>([0-9]+)</td>')
proxy_pool = []
for match in regex.findall(reponse.text):
proxy_pool.append(f"{match[0]}:{match[1]}")
def is_good_proxy(pip):
try:
proxy_handler = urllib.request.ProxyHandler({'http': pip})
opener = urllib.request.build_opener(proxy_handler)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
# change the URL to test here
req = urllib.request.Request('http://www.example.com')
sock = urllib.request.urlopen(req, timeout=5)
except urllib.error.HTTPError as e:
return False
except Exception as detail:
return False
return True
working_proxies = []
for proxy in proxy_pool:
if is_good_proxy(proxy):
working_proxies.append(proxy)
Thanks @oXis , Just take a look at the code I reviewed and we can merge it.
I'll work on the rotating proxy on Monday.
Why is archives.org timeing out? That's weird.
It isn't archive.org
that was timing out. We are now using my personal website somdev.me
for error free and faster checking. Hang on.
The intels are searched only inside the page plain text to avoid retrieving tokens are garbage javascript code. Better regular expressions could allow searching inside javascript code for intels though.
v1.3.0
-p, --proxy
option (http proxy only)Tested on