Closed nickdibari closed 7 years ago
Implemented feature to only pull IPs and Hosts from US. Still does not seem to return an audio file, as widget thinks (or knows really) we're sending automated requests
Changing site used to http://freeproxylists.net
EDIT: freeproxylists didn't work, ironically enough you need to pass a captcha to access the site so we can't use it for automation. Switching to coolproxys to hopefully curtail this
EDIT 2: coolproxys did not work either, as they use Javascript to write the IP address in table, obfuscating it from scrapers and making my job that much harder. Final shot is to use spys.me as they post a text file of proxies hourly
Good news: spys.me works really well. Using regexs to parse the text returned for IPs and Hosts from the US. Also updates every hour which will help us keep a fresh proxy pool
Bad news: Still being flagged as an automated request so no challenge returned. What could be the cause of reCaptcha refusing to return an audio challenge when using a proxy?
Upon further investigation, looks like the root problem might be that the proxies we use have already been blacklisted by Google and are known to be compromised. It makes sense that they would then block that IP address from reCaptcha use.
The first thing we should do is try to see if ANY of the proxies we are scraping work. If we can get one working then we can hopefully do some type of loop to check all the proxies in our list to find at least one that will take
Also should consider expanding the criteria for what proxies to add to the pool. Right now it's selecting US proxies with the Google Passed attribute. Consider removing the second check on Google Passed as it does not seem to be all that relevant anymore. More proxies the better anyway
Got one! Tested with the following proxy and was able to download a challenge audio file:
Server: 104.37.212.5
Host: 3128
So we know that some of these proxies can work, we just might need to try a couple first
EDIT: Full success (download file->convert to text->pass correct answer->submit form) on the following proxy:
Server: 52.45.142.12
Host: 3128
Seems like this is working for now. Should implement a way to try multiple proxies as it could be that some work and some don't
Fixed for now in #18. Further implementations can fine tune the proxy pool but this works!
Proxy retrieved from pool is not guaranteed to work Seems to time out when loading reCaptcha
Currently no check for working reCaptcha Implement a feature to check that reCaptcha doesn't time out
No system to choose proxy Should keep track of which ones have been used before