prncc / steam-scraper

A pair of spiders for scraping product data and reviews from Steam.
https://intoli.com/blog/steam-scraper/
79 stars 39 forks source link

Question about stopping condition for reviews #5

Closed zhenzuo2 closed 6 years ago

zhenzuo2 commented 6 years ago

Thank you so much for sharing this.

I am confused about when should this scraper stops scraping. Some products have millions of reviews and sometimes I need to run the code several times until I get all reviews because it may stop for some reason before it scrapes all the reviews. I am not sure why this happens, maybe internet issue? Currently, I can get full of reviews every three or four times running the codes.

Another issue I notice is that is it possible to scrape extra reviews based on current reviews.jl? Because it is real-time steaming data. Or I have to scrape the whole thing from beginning every time I want to update the database.

Thanks.

zhenzuo2 commented 6 years ago

I added some logs 'downloader/request_count': 99063, 'downloader/request_method_count/GET': 99063, 'downloader/response_bytes': 487440227, 'downloader/response_count': 99063, 'downloader/response_status_count/200': 99039, 'downloader/response_status_count/500': 3, 'downloader/response_status_count/503': 21, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 2, 9, 21, 56, 24, 552460), 'httpcache/firsthand': 99047, 'httpcache/hit': 16, 'httpcache/miss': 99047, 'httpcache/store': 99047, 'httperror/response_ignored_count': 8, 'httperror/response_ignored_status_count/500': 1, 'httperror/response_ignored_status_count/503': 7, 'item_scraped_count': 988580, 'log_count/DEBUG': 1087653, 'log_count/INFO': 303, 'memusage/max': 85041152, 'memusage/startup': 52342784, 'request_depth_max': 5685, 'response_received_count': 99047, 'retry/count': 16, 'retry/max_reached': 8, 'retry/reason_count/500 Internal Server Error': 2, 'retry/reason_count/503 Service Unavailable': 14, 'scheduler/dequeued': 99062, 'scheduler/dequeued/disk': 99062, 'scheduler/enqueued': 99062, 'scheduler/enqueued/disk': 99062, 'start_time': datetime.datetime(2018, 2, 9, 17, 9, 15, 689499)} 2018-02-09 21:56:24 [scrapy.core.engine] INFO: Spider closed (finished)

prncc commented 6 years ago

Reviews are scraped by looking for a hidden form in a page of reviews and then submitting the form to get the next page. If your scraping job stops before collecting all available reviews, it's likely because Steam doesn't serve one of the requested review pages mid-way through. Perhaps that's when the 503 errors arise?

Have you tried lowering the scraping rate, or enabling scrapy's RetryMiddleware?

prncc commented 6 years ago

I just pushed a commit that removes AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0 from the settings, which makes the spider use its default value of 1.0. Does that help? You can scrape a single game's reviews by running the spider with something similar to:

scrapy crawl reviews -a steam_id=289070
zhenzuo2 commented 6 years ago

It works. Thanks a lot.