prncc / steam-scraper

A pair of spiders for scraping product data and reviews from Steam.
https://intoli.com/blog/steam-scraper/
77 stars 39 forks source link

Incomplete data in Reviews.jl and occasional ding sound #9

Closed RamiJabor closed 6 years ago

RamiJabor commented 6 years ago

I'm getting multiple rows 10-50 of incomplete data at seemingly random places at multiple places in the jl file while checking up on it during reviewscraping, It seems to be consistent over a single period of rows and a single appid at that period. I also keep getting a Windows Message 'Ding' sound every 3-5 min but not sure if that's connected to the error.

I don't get the problem while only reviewscraping a couple games but crawling many give me this so might be that im getting incomplete responses due to to much fetching. Really dont know

Missing data is usually the "recommended", "text" "date",

Examples with added "BAD", "GOOD":

GOOD{"product_id": "24010", "page": 114, "page_order": 1, "text": "All TS2014 users will receive a small automatic update via Steam Tonight following our change of name from RailSimulator.com to Dovetail Games. The update will replace the previous RailSimulator.com logos and copyright notices on the user interface (UI) and elsewhere within the software with Dovetail Games logos and copyright notices. The End User License Agreement (EULA) wording will also be updated to reflect our change of name from RailSimulator.com to Dovetail Games. We are also taking this opportunity to provide a small fix to prevent the track diagram flickering experienced at certain points on the Holiday Express add-on. This update will not change the functionality or operation of Train Simulator routes, locos, scenarios or any add-on content, and we are not making any other changes to the EULA in this update. We would like to assure all Train Simulator users that this is a very small update which is not intended to change the operation or content of Train Simulator in any way other than the changes described above.", "user_id": 170181655, "early_access": false}
GOOD{"product_id": "24010", "page": 114, "page_order": 2, "text": "The commuter BR111 electric locomotive seen across Germany since the 1970s is now available for Train Simulator, with accompanying DBbzf Control Car and double decker passenger coaches.\nIn the early 1970s, Deutsche Bahn’s demand for electric locomotives for passenger trains saw the development of the BR111, a successor to the Class 110 and built to accommodate faster speeds on passenger services.  A total of 227 models were built in the class between 1970 and 1982, with the first locomotive delivered in December 1974.\nMany of the class were put into service on S-Bahn services, although with ageing locomotive stock serving Intercity routes, some were fitted for Intercity services and operated across Germany. Each of the locomotive’s four axles were fitted with engines, supplying 4,990hp (3,720 kW) and a top speed of 160km/h (99mph).\nThe BR111 often operated in conjunction with a rear control car, giving push-pull capabilities on commuter services due to its ZWS remote control, operated from the locomotive. A third generation DBbzf control car is included with the BR111 for Train Simulator, alongside double-decker DBz and DAbz coaches.\nThe BR111 for Train Simulator is available in two Deutsche Bahn liveries – Orient Red and Traffic Red - and features a DBbzf control car in mint turquoise and traffic red liveries, realistic wheel slip and sanding effects, SIFA driver vigilance device, PZB train protection system and double-decker coaches.\nThe locomotive is also Quick Drive compatible, giving you the freedom to drive the DB BR111 on any Quick Drive enabled route for Train Simulator, such as those available through Steam.\nAlso included are six scenarios for the Hamburg-Hanover route:\nMore scenarios are available on Steam Workshop online and in-game. Train Simulator’s Steam Workshop scenarios are free and easy to download, adding many more hours of exciting gameplay\nKey Features\n•\tBR111 in Deutsche Bahn Traffic Red and Orient Red liveries\n•\tDBbzf Control Car in mint turquoise and traffic red liveries\n•\tPZB and SIFA systems\n•\tDouble-decker passenger coaches\n•\tQuick Drive compatible\n•\tScenarios for the Hamburg-Hanover route\n•\tDownload size: 423mb\nGet it now on Steam - http://store.steampowered.com/app/222598/", "user_id": 170181655, "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 3, "user_id": 170181655, "username": "InterCity560", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 4, "user_id": 170181655, "username": "raphaël-2903", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 5, "user_id": 170181655, "username": "Dash 7 Studios", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 6, "user_id": 170181655, "username": "raphaël-2903", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 7, "user_id": 170181655, "username": "AddictiveBiscuit", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 8, "user_id": 170181655, "username": "Dash 7 Studios", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 9, "user_id": 170181655, "username": "ljbreci", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 10, "user_id": 170181655, "username": "Whitemead", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 11, "user_id": 170181655, "username": "mwaldham131", "early_access": false}
GOOD{"product_id": "1930", "page": 50, "page_order": 1, "recommended": false, "date": "2016-04-23", "text": "My Grandfather smoked his whole life. I was about 10 years old when my mother said to him, 'If you ever want to see your grandchildren graduate, you have to stop immediately.'. Tears welled up in his eyes when he realized what exactly was at stake. He gave it up immediately. Three years later he died of lung cancer. It was really sad and destroyed me. My mother said to me- 'Don't ever smoke. Please don't put your family through what your Grandfather put us through.\" I agreed. At 28, I have never touched a cigarette. I must say, I feel a very slight sense of regret for never having done it, because this game gave me cancer anyway.", "hours": 6.7, "user_id": 92964214, "username": "Monkey D. Luffy 💦", "products": 305, "found_funny": 4, "early_access": false}
{"product_id": "24010", "page": 114, "page_order": 12, "user_id": 170181655, "username": "Banter420", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 13, "user_id": 170181655, "username": "Neal08Ni/NealLIVE", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 14, "user_id": 170181655, "username": "StevenJam", "early_access": false}
GOOD{"product_id": "1930", "page": 50, "page_order": 3, "recommended": true, "date": "2016-04-17", "text": "one word: epic and tricky", "hours": 0.7, "user_id": 92964214, "username": "babyboi", "products": 32, "early_access": false}
GOOD{"product_id": "1930", "page": 50, "page_order": 4, "recommended": true, "date": "2016-04-15", "text": "If you do not like this game..,,you suck. What a hidden gem, I got it on sale and played it a bit like I do most of the games I buy and then go back to them later and play a bit more. But this game is quite addictive, and I really dont see anything wrong with the graphics. Nice enviroments, somewhat ugly animals but I can live with that. The human characters are a bit goofy but there are so many good things about this game that any not up to snuff eye candy is easily overlooked. Very large map and a good bit of content makes this a sleep killer.", "hours": 61.2, "user_id": 92964214, "username": "BrookenG", "products": 177, "early_access": false}

I don't know what the problem could be. Haven't found any similar problems/issues while looking for solutions

prncc commented 6 years ago

Did you have a chance to check if these reviews definitely have the missing data in the HTML?

RamiJabor commented 6 years ago

Not sure, but i saw that it was giving incomplete info in DEBUG in the console. Where can i see the HTML code i get from the responses? I did check these reviews in browser and in indivduals scrapy runs.

After 15 hours i had aprox 430 thousand reviews but after 23 only 512 thousand so it seems like it slowed down the longer it was on.

Closed the spider after 23 hours of scraping

2018-04-23 15:35 [scrapy.extionsions.feedexport] INFO: Stored jl feed (512038 items) in: testreviews.jl
2018-04-23 15:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 77937165,
 'downloader/request_count': 59933,
 'downloader/request_method_count/GET': 59933,
 'downloader/response_bytes': 246292536,
 'downloader/response_count': 59933,
 'downloader/response_status_count/200': 59454,
 'downloader/response_status_count/302': 473,
 'downloader/response_status_count/504': 6,
 'dupefilter/filtered': 101,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2018, 4, 23, 13, 35, 20, 146443),
 'httpcache/firsthand': 59927,
 'httpcache/hit': 6,
 'httpcache/miss': 59927,
 'httpcache/store': 59454,
 'httpcache/uncacheable': 473,
 'httperror/response_ignored_count': 2,
 'httperror/response_ignored_status_count/504': 2,
 'item_scraped_count': 512038,
 'log_count/DEBUG': 574891,
 'log_count/ERROR': 2,
 'log_count/INFO': 1265,
 'request_depth_max': 2749,
 'response_received_count': 59456,
 'retry/count': 4,
 'retry/max_reached': 2,
 'retry/reason_count/504 Gateway Time-out': 4,
 'scheduler/dequeued': 59930,
 'scheduler/dequeued/disk': 59930,
 'scheduler/enqueued': 59948,
 'scheduler/enqueued/disk': 59948,
 'spider_exceptions/IndexError': 2,
 'start_time': datetime.datetime(2018, 4, 22, 16, 40, 42, 767437)}
2018-04-23 15:35:20 [scrapy.core.engine] INFO: Spider closed (shutdown)

257513 Successful Rest failed, missing "recommended. So aprox half of the rows are failed. Less failed rows in the begining and more later on in the fiile.

Tried changing the 'USER AGENT' in settings to 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36' after seeing it here https://stackoverflow.com/questions/33851754/scrapy-misses-some-html-elements

Started getting these error messages instead and then ONLY reviews with missing info, so only failed responese.

2018-04-23 16:09:01 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
    self.crawler_process.start()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\crawler.py", line 291, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 1243, in run
    self.mainLoop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\engine.py", line 122, in _next_request
    if not self._next_request_from_scheduler(spider):
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\engine.py", line 149, in _next_request_from_scheduler
    request = slot.scheduler.next_request()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\scheduler.py", line 71, in next_request
    request = self._dqpop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\scheduler.py", line 106, in _dqpop
    d = self.dqs.pop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\queuelib\pqueue.py", line 43, in pop
    m = q.pop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\squeues.py", line 19, in pop
    s = super(SerializableQueue, self).pop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\queuelib\queue.py", line 162, in pop
    self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
builtins.OSError: [Errno 22] Invalid argument

Changed it back to 'Steam Scraper' and tested with test_urls.txt and now im getting the same error above over and over again until aborting the scraping. Still testing around and might reinstall the env to test again.

RamiJabor commented 6 years ago

Seems like some file has been corrupted. Found a similiar problem in https://github.com/scrapy/scrapy/issues/845. Gonna try restoring/reseting the whole project.

RamiJabor commented 6 years ago

OK, so it seems that there's no problem with the actual scraper and more a problem with the HTML sites. Managed to get a HTML site from the missing/damaged(Lets just call them BAD) reviews during the console/debug run So the reason I'm getting multiple BAD reviews within multiple reviews in a row is that the all originate from the same html site except they get diffrent page order. Example:

2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
 'page': 141,
 'page_order': 5,
 'product_id': '304050',
 'user_id': 123377508,
 'username': 'Chaoz Designz'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
 'page': 141,
 'page_order': 6,
 'product_id': '304050',
 'user_id': 123377508,
 'username': 'Aynat | twitch.tv/aynatanya <3'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
 'page': 141,
 'page_order': 7,
 'product_id': '304050',
 'user_id': 123377508,
 'username': 'Evilagician'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
 'page': 141,
 'page_order': 8,
 'product_id': '304050',
 'user_id': 123377508,
 'username': `'Evilagician'}

And it goes on to page order 25 where it stops. other examples variate on how many pageorders it goes on for. Haven't found out why yet. Now i checked the HTML code and surely enough it missing all important review infos becuase it's not a review page at all. It's some news announcment page or something similar without any reviews but containing Mabye users that own the game or have commented somethins so the scraper picks up. names, user ids, early access from the html code and all other info it gets without the html scraping.

So the problem is that the scraper is scraping non-review sites

Dunno why yet and have no idea how to solve it. Any tips/thoughts?

EDIT: Noticed that its always links with starting with https://steamcommunity.com/app/304050/homecontent/?announcementsoffset since its supposed to be https://steamcommunity.com/app/304050/homecontent/?userreviewsoffset</b>

EDIT2 PS: Made the scraper faster by changing all the urls in urls.txt from HTTP to HTTPS since it was redirecting for every link. Will look into implementing it in the code later

EDIT3: Im guessing the bug/problem lies in review_spider.py Somewhere between line 104 and 131. Specifically line 114 form = response.xpath('//form[contains(@id, "MoreContentForm")]')

EDIT4: Line 114 form = response.xpath('//form[contains(@id, "MoreContentForm")]') scrapes out the following:

<form method="GET" id="MoreContentForm1" name="MoreContentForm1" action="https://steamcommunity.com/app/9010/homecontent/">
<input type="hidden" name="userreviewsoffset" value="10"><input type="hidden" name="p" value="2"><input type="hidden" name="workshopitemspage" value="2"><input type="hidden" name="readytouseitemspage" value="2"><input type="hidden" name="mtxitemspage" value="2"><input type="hidden" name="itemspage" value="2"><input type="hidden" name="screenshotspage" value="2"><input type="hidden" name="videospage" value="2"><input type="hidden" name="artpage" value="2"><input type="hidden" name="allguidepage" value="2"><input type="hidden" name="webguidepage" value="2"><input type="hidden" name="integratedguidepage" value="2"><input type="hidden" name="discussionspage" value="2"><input type="hidden" name="numperpage" value="10"><input type="hidden" name="browsefilter" value="mostrecent">

Where name="userreviewsoffset" sometimes is "announcementsoffset" Does that mean there are no more reviews? Is there a way to skip announcement/news pages and make sure it's a "userreviewoffset" site? Is there a way to ignore/not scrape from these sites and just move on with next page if there still are reviews to be scraped? Questions! I will keep looking to see if i can fix this.

prncc commented 6 years ago

@RamiJabor I just successfully scraped all of the reviews for the game you are getting bad data from with the following command:

scrapy crawl reviews -a steam_id=304050 -o 304050.jl --loglevel INFO

I was getting about 60-65 pages per minute (~600-700 reviews per minute) which is roughly in-inline with what you reported in your first 15 hours.

It looks like you've either hit Steam's server too much, have encountered a glitch of some kind, or the behavior you're reporting is not present in the US store.

RamiJabor commented 6 years ago

Yes, i did the same. Tested out the games i was getting bad reviews from individually and got good results. I tested increasing the number of urls in the url.txt file and at 9 or 10 urls it starts to give bad data from announcement pages. I've tried setting the concurrent_request=1 still the same.... I don't think it has to do with the steam server but frankly i have no idea what it could be anymore. I would appreciate if anyone could try scraping from textfile with over 15 urls so i know if i'm the only one having this problem.

prncc commented 6 years ago

@RamiJabor Since this issue isn't related to the scraper code, I'm going to close it. In the meantime, consider slowing down the scraper from the IP you're using.