Closed RamiJabor closed 6 years ago
Did you have a chance to check if these reviews definitely have the missing data in the HTML?
Not sure, but i saw that it was giving incomplete info in DEBUG in the console. Where can i see the HTML code i get from the responses? I did check these reviews in browser and in indivduals scrapy runs.
After 15 hours i had aprox 430 thousand reviews but after 23 only 512 thousand so it seems like it slowed down the longer it was on.
Closed the spider after 23 hours of scraping
2018-04-23 15:35 [scrapy.extionsions.feedexport] INFO: Stored jl feed (512038 items) in: testreviews.jl
2018-04-23 15:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 77937165,
'downloader/request_count': 59933,
'downloader/request_method_count/GET': 59933,
'downloader/response_bytes': 246292536,
'downloader/response_count': 59933,
'downloader/response_status_count/200': 59454,
'downloader/response_status_count/302': 473,
'downloader/response_status_count/504': 6,
'dupefilter/filtered': 101,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2018, 4, 23, 13, 35, 20, 146443),
'httpcache/firsthand': 59927,
'httpcache/hit': 6,
'httpcache/miss': 59927,
'httpcache/store': 59454,
'httpcache/uncacheable': 473,
'httperror/response_ignored_count': 2,
'httperror/response_ignored_status_count/504': 2,
'item_scraped_count': 512038,
'log_count/DEBUG': 574891,
'log_count/ERROR': 2,
'log_count/INFO': 1265,
'request_depth_max': 2749,
'response_received_count': 59456,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/504 Gateway Time-out': 4,
'scheduler/dequeued': 59930,
'scheduler/dequeued/disk': 59930,
'scheduler/enqueued': 59948,
'scheduler/enqueued/disk': 59948,
'spider_exceptions/IndexError': 2,
'start_time': datetime.datetime(2018, 4, 22, 16, 40, 42, 767437)}
2018-04-23 15:35:20 [scrapy.core.engine] INFO: Spider closed (shutdown)
257513 Successful Rest failed, missing "recommended. So aprox half of the rows are failed. Less failed rows in the begining and more later on in the fiile.
Tried changing the 'USER AGENT' in settings to 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36' after seeing it here https://stackoverflow.com/questions/33851754/scrapy-misses-some-html-elements
Started getting these error messages instead and then ONLY reviews with missing info, so only failed responese.
2018-04-23 16:09:01 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
self.crawler_process.start()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\crawler.py", line 291, in start
reactor.run(installSignalHandlers=False) # blocking call
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 1243, in run
self.mainLoop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\engine.py", line 122, in _next_request
if not self._next_request_from_scheduler(spider):
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\engine.py", line 149, in _next_request_from_scheduler
request = slot.scheduler.next_request()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\scheduler.py", line 71, in next_request
request = self._dqpop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\scheduler.py", line 106, in _dqpop
d = self.dqs.pop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\queuelib\pqueue.py", line 43, in pop
m = q.pop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\squeues.py", line 19, in pop
s = super(SerializableQueue, self).pop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\queuelib\queue.py", line 162, in pop
self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
builtins.OSError: [Errno 22] Invalid argument
Changed it back to 'Steam Scraper' and tested with test_urls.txt and now im getting the same error above over and over again until aborting the scraping. Still testing around and might reinstall the env to test again.
Seems like some file has been corrupted. Found a similiar problem in https://github.com/scrapy/scrapy/issues/845. Gonna try restoring/reseting the whole project.
OK, so it seems that there's no problem with the actual scraper and more a problem with the HTML sites. Managed to get a HTML site from the missing/damaged(Lets just call them BAD) reviews during the console/debug run So the reason I'm getting multiple BAD reviews within multiple reviews in a row is that the all originate from the same html site except they get diffrent page order. Example:
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
'page': 141,
'page_order': 5,
'product_id': '304050',
'user_id': 123377508,
'username': 'Chaoz Designz'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
'page': 141,
'page_order': 6,
'product_id': '304050',
'user_id': 123377508,
'username': 'Aynat | twitch.tv/aynatanya <3'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
'page': 141,
'page_order': 7,
'product_id': '304050',
'user_id': 123377508,
'username': 'Evilagician'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
'page': 141,
'page_order': 8,
'product_id': '304050',
'user_id': 123377508,
'username': `'Evilagician'}
And it goes on to page order 25 where it stops. other examples variate on how many pageorders it goes on for. Haven't found out why yet. Now i checked the HTML code and surely enough it missing all important review infos becuase it's not a review page at all. It's some news announcment page or something similar without any reviews but containing Mabye users that own the game or have commented somethins so the scraper picks up. names, user ids, early access from the html code and all other info it gets without the html scraping.
So the problem is that the scraper is scraping non-review sites
Dunno why yet and have no idea how to solve it. Any tips/thoughts?
EDIT: Noticed that its always links with starting with
https://steamcommunity.com/app/304050/homecontent/?announcementsoffset
since its supposed to be
https://steamcommunity.com/app/304050/homecontent/?userreviewsoffset</b>
EDIT2 PS: Made the scraper faster by changing all the urls in urls.txt from HTTP to HTTPS since it was redirecting for every link. Will look into implementing it in the code later
EDIT3: Im guessing the bug/problem lies in review_spider.py Somewhere between line 104 and 131. Specifically line 114 form = response.xpath('//form[contains(@id, "MoreContentForm")]')
EDIT4:
Line 114 form = response.xpath('//form[contains(@id, "MoreContentForm")]')
scrapes out the following:
<form method="GET" id="MoreContentForm1" name="MoreContentForm1" action="https://steamcommunity.com/app/9010/homecontent/">
<input type="hidden" name="userreviewsoffset" value="10"><input type="hidden" name="p" value="2"><input type="hidden" name="workshopitemspage" value="2"><input type="hidden" name="readytouseitemspage" value="2"><input type="hidden" name="mtxitemspage" value="2"><input type="hidden" name="itemspage" value="2"><input type="hidden" name="screenshotspage" value="2"><input type="hidden" name="videospage" value="2"><input type="hidden" name="artpage" value="2"><input type="hidden" name="allguidepage" value="2"><input type="hidden" name="webguidepage" value="2"><input type="hidden" name="integratedguidepage" value="2"><input type="hidden" name="discussionspage" value="2"><input type="hidden" name="numperpage" value="10"><input type="hidden" name="browsefilter" value="mostrecent">
Where name="userreviewsoffset" sometimes is "announcementsoffset" Does that mean there are no more reviews? Is there a way to skip announcement/news pages and make sure it's a "userreviewoffset" site? Is there a way to ignore/not scrape from these sites and just move on with next page if there still are reviews to be scraped? Questions! I will keep looking to see if i can fix this.
@RamiJabor I just successfully scraped all of the reviews for the game you are getting bad data from with the following command:
scrapy crawl reviews -a steam_id=304050 -o 304050.jl --loglevel INFO
I was getting about 60-65 pages per minute (~600-700 reviews per minute) which is roughly in-inline with what you reported in your first 15 hours.
It looks like you've either hit Steam's server too much, have encountered a glitch of some kind, or the behavior you're reporting is not present in the US store.
Yes, i did the same. Tested out the games i was getting bad reviews from individually and got good results. I tested increasing the number of urls in the url.txt file and at 9 or 10 urls it starts to give bad data from announcement pages. I've tried setting the concurrent_request=1 still the same.... I don't think it has to do with the steam server but frankly i have no idea what it could be anymore. I would appreciate if anyone could try scraping from textfile with over 15 urls so i know if i'm the only one having this problem.
@RamiJabor Since this issue isn't related to the scraper code, I'm going to close it. In the meantime, consider slowing down the scraper from the IP you're using.
I'm getting multiple rows 10-50 of incomplete data at seemingly random places at multiple places in the jl file while checking up on it during reviewscraping, It seems to be consistent over a single period of rows and a single appid at that period. I also keep getting a Windows Message 'Ding' sound every 3-5 min but not sure if that's connected to the error.
I don't get the problem while only reviewscraping a couple games but crawling many give me this so might be that im getting incomplete responses due to to much fetching. Really dont know
Missing data is usually the "recommended", "text" "date",
Examples with added "BAD", "GOOD":
I don't know what the problem could be. Haven't found any similar problems/issues while looking for solutions