jaanli / food2vec

:hamburger:
MIT License
221 stars 48 forks source link

scraping allrecipes website response errors #13

Open schnapi opened 7 years ago

schnapi commented 7 years ago

I would like to know why I am getting a lot of errors like this when I want to scrape allrecipes.com?

Thanks!

2017-10-27 13:31:38 [allrecipes] DEBUG: No item received for http://allrecipes.com/recipe/16348/baked-pork-chops-i/
2017-10-27 13:31:38 [scrapy.core.scraper] ERROR: Spider error processing <GET http://allrecipes.com/recipe/16348/baked-pork-chops-i/> (referer: http://allrecipes.com/recipes/?page=2)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/mnt/c/Users/Sandi/Desktop/food2vec-master/food2vec-master/dat/RecipesScraper/RecipesScraper/spiders/allrecipes_spider.py", line 33, in parse_item
    if len(data['items']) == 0:
TypeError: list indices must be integers, not str
schnapi commented 7 years ago
2017-10-27 13:36:31 [scrapy.extensions.logstats] INFO: Crawled 382 pages (at 86 pages/min), scraped 0 items (at 0 items/min)
2017-10-27 13:36:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=33> (referer: None)
2017-10-27 13:36:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=34> (referer: None)
2017-10-27 13:36:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=35> (referer: None)
2017-10-27 13:36:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=36> (referer: None)
2017-10-27 13:36:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=37> (referer: None)
2017-10-27 13:36:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=38> (referer: None)
2017-10-27 13:36:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=40> (referer: None)
2017-10-27 13:36:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=39> (referer: None)
2017-10-27 13:36:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=41> (referer: None)
2017-10-27 13:36:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=42> (referer: None)
2017-10-27 13:36:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=43> (referer: None)
2017-10-27 13:36:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=45> (referer: None)
2017-10-27 13:36:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=44> (referer: None)
2017-10-27 13:36:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=48> (referer: None)
2017-10-27 13:36:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=46> (referer: None)
2017-10-27 13:36:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipe/13423/my-chili/> (referer: http://allrecipes.com/recipes/?page=34)
2017-10-27 13:36:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=47> (referer: None)
2017-10-27 13:36:58 [scrapy.core.scraper] ERROR: Spider error processing <GET http://allrecipes.com/recipe/13423/my-chili/> (referer: http://allrecipes.com/recipes/?page=34)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/mnt/c/Users/Sandi/Desktop/food2vec-master/food2vec-master/dat/RecipesScraper/RecipesScraper/spiders/allrecipes_spider.py", line 31, in parse_item
    if len(data['items']) == 0:
TypeError: list indices must be integers, not str
schnapi commented 7 years ago

Do you still have all recipes file? Also allrecipes website blocked my ip. Do you have any suggestion how to handle this problem? Thank you!

jaanli commented 7 years ago

Thanks @schnapi -- cc'ing @brandonmburroughs here too in case he's interested (he wrote a great scraper for it).

Let me know if the allrecipes file here works for you:

https://github.com/altosaar/food2vec/tree/master/dat

There are also preprocessing scripts here: https://github.com/altosaar/food2vec/blob/master/src/process_scraped_data.py

aayushworkiitr commented 6 years ago

Facing a similar issue here. I wrote a scraper for allrecipes and initially I got data from the website but they have probably blacklisted my IP. Does anyone know a good work-around?