singlerider / applicationbot

Scrapes Craigslist and Indeed
3 stars 0 forks source link

Craigslist IP Blocks #1

Open singlerider opened 9 years ago

singlerider commented 9 years ago

After a given amount of requests in a short timeframe, Craigslist will autoblock access to segments of the service being scraped. Craigslist also blocks known Tor connections by default and this application will return an error of "403" (unauthorized) if either of the two conditions are met. The best combination will likely be doing a sleep(n) with a randomized float in between 5 to 10 seconds between request and using external proxies. After enough testing and time between my IP getting blocked or by resolving this another way, I will close this issue with the best method.

singlerider commented 9 years ago

Craigslist uses Captcha checkboxes to prevent bot scrapes.