pgaref / HTTP_Request_Randomizer

Proxying Python Requests
http://pgaref.com/blog/python-proxy/
MIT License
151 stars 59 forks source link

Add more Proxy Providers #36

Open pgaref opened 7 years ago

pgaref commented 7 years ago

Possible Proxy Lists

Providers that require an API key:

Could also parse related forum thread - blackhatworld:

la55u commented 6 years ago

I may add a couple of these later if you provide some info on how to do it, what needs to be implemented etc.

pgaref commented 6 years ago

Hello @la55u

Every proxy parser currently extends the base UrlParser class which represents any URL containing Proxy information. If you check an implementation, such as SamairProxyParser class, it mainly overrides the parse_proxyList method which does three things: 1) parses the page html code 2) retrieves proxy information from html and 3) returns a list of proxyObjects The html parsing part is automated by BeautifulSoup which should makes it a bit easier.

If you want to support a new provider, for instance coolProxy you would create a new class extending UrlParser. Then by inspecting the html fields needs and using BeautifulSoup you could retrive the proexy information. You might also need to decode hidden information: for example the IP of the specific provider is encoded. You will need to do something like:

base64.b64decode(codecs.getencoder( "rot-13" )("IP_string"))

PS: Some of the existing provides have updated their websites, adding extra javascript or encodings to hide proxy information (thats why some existing providers currently fail). However this does not mean there is not a way around it :)

Let me know if this makes sense - I would be happy to help!

la55u commented 6 years ago

I don't really get this encoding. For example the IPs that are listed on the coolProxy website are not the actual proxy IPs that we need? edit: oh wait I think i get it; the IPs are not present in the html when we query it from python! i'll add this site later this week.

pgaref commented 6 years ago

Hey @la55u

When you view the source code of the provider (Ctrl+U with Google Chrome) you will realise that every proxy is a row in an html table. In that table most of the information can be traversed directly but the IPs for example are 'text/javascript' elements - meaning that you need to do a bit more to decode them :)

In the provider above for example, the first tag(td) in the first table row I found looks like:

Now if we do encode in rot13 the stirng above: codecs.getencoder( "rot-13" )("BGZhBGxhAv4kAGt=") we will get something like: "OTMuOTkuNi4xNTg=" Then if we decode to base64 the above we get: 93.99.6.158 which is the IP we were actually looking for!

Does it make sense?