samwize / python-email-crawler

Search on Google, and crawls for emails related to the result
292 stars 127 forks source link

TypeError: expected string or buffer #3

Open ghost opened 10 years ago

ghost commented 10 years ago

Sometimes it runs, sometimes it doesn't.

[14:22:38] INFO::email_crawler - Crawling http://www.google.com.au/search?q=electrician&start=0
[14:22:39] ERROR::email_crawler - Exception at url: http://www.google.com.au/search?q=electrician&start=0
HTTP Error 503: Service Unavailable
[14:22:39] ERROR::email_crawler - EXCEPTION: expected string or buffer 
pmuens commented 10 years ago

+1 Same here!

rkshakya commented 9 years ago

+1 Same here. Could you please suggest a fix for this? Thank you

DonatoNapoli commented 9 years ago

+1 Same problem

dcondrey commented 8 years ago
python email_crawler.py "intext:gmail filetype:csv"
[10:14:12] INFO::email_crawler - ----------------------------------------
[10:14:12] INFO::email_crawler - Keywords to Google for: intext:gmail filetype:csv
[10:14:12] INFO::email_crawler - ----------------------------------------
[10:14:12] INFO::email_crawler - Crawling http://www.google.com/search?q=intext%3Agmail+filetype%3Acsv&start=0
[10:14:14] INFO::email_crawler - Crawling http://www.google.com/search?q=intext%3Agmail+filetype%3Acsv&start=10
...
[10:14:59] ERROR::email_crawler - Exception at url: http://www.google.com/search?q=intext%3Agmail+filetype%3Acsv&start=390
HTTP Error 503: Service Unavailable
[10:14:59] ERROR::email_crawler - EXCEPTION: expected string or buffer 
Traceback (most recent call last):
  File "email_crawler.py", line 212, in <module> 
    crawl(arg)
  File "email_crawler.py", line 65, in crawl
    for url in google_url_regex.findall(data):
TypeError: expected string or buffer
hamdi-islam commented 8 years ago

same problem

dcondrey commented 8 years ago

This issue should be resolved with this merge https://github.com/samwize/python-email-crawler/pull/7

thomaslc66 commented 8 years ago

issue still not resolved, same here with the last version cloned from git on my linux

mrkkr commented 7 years ago

I still have a problem with "TypeError: expected string or buffer" . Can anyone help?

vizieral commented 7 years ago

Have the same issue as well

kevingatera commented 7 years ago

Here is a solution to your problem;

  1. Open the file email_crawler.py (If you are using the terminal use nano email_crawler.py to edit the file)
  2. Go to the 24th line saying MAX_SEARCH_RESULTS = 500 and then change it to MAX_SEARCH_RESULTS = 100

Note that the reason behind this is that due to the fact that the scripts crawls 500 pages of google, the later treats the requests as spam and proceeds accordingly as if it's a spam-like script trying to scrape the internet using Google's search engine.

charlieporth1 commented 6 years ago

I've got it too, and what @kevingatera didn't work the exact error I get is
It happens before it even gets the second page done so it's not the script being blocked

:~/python-email-crawler$ python email_crawler.py "ios developers" [19:05:06] INFO::email_crawler - ---------------------------------------- [19:05:06] INFO::email_crawler - Keywords to Google for: ios developers [19:05:06] INFO::email_crawler - ---------------------------------------- [19:05:06] INFO::email_crawler - Crawling http://www.google.com/search?q=ios+developers&start=0 [19:05:06] ERROR::email_crawler - Exception at url: http://www.google.com/searchq=ios+developers&start=0 HTTP Error 503: Service Unavailable [19:05:06] ERROR::email_crawler - EXCEPTION: expected string or buffer traceback (most recent calll ast): File "email_crawler.py", line 212, in <module> crawl(arg) File "email_crawler.py", line 65, in crawl for url in google_url_regex.findall(data) typeError: expected string or buffer

kevingatera commented 6 years ago

@charlieporth1 What's happening is that Google blocks your IP almost as soon as they get your request. Using another computer/IP will work.

charlieporth1 commented 6 years ago

@kevingatera turns out I was using torify and that didn't help. You should include IP rotation similar to whats in here here I would help you if I knew more about python