Open RamiJabor opened 6 years ago
I think there's another issue with my products_all.jl since its missing n_reviews too which is preventing me from moving on to review scraping. The example output is not matching what i get in my products_all.jl
{"url": "http://store.steampowered.com/app/800200/Witching_Tower_VR/", "reviews_url": "http://steamcommunity.com/app/800200/reviews/?browsefilter=mostrecent&p=1", "id": "800200", "title": "Witching Tower VR", "genres": ["Action", "Adventure", "Indie"], "developer": "Daily Magic Productions", "publisher": "Daily Magic Productions", "release_date": "Summer 2018", "app_name": "Witching Tower VR", "specs": ["Single-player"], "tags": ["Action", "Adventure", "Indie", "VR", "Violent", "Puzzle", "Atmospheric"], "early_access": false}
Im getting this error when trying to run the the split_review_urls.
(env) C:\Users\Rami\steam-scraper\scripts>py split_review_urls.py --scraped-products C:\Users\Rami\steam-scraper\output/products_all.jl --output-dir C:\Users\Rami\steam-scraper\output
Traceback (most recent call last):
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'n_reviews'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "split_review_urls.py", line 71, in <module>
main()
File "split_review_urls.py", line 48, in main
blx_has_reviews = df['n_reviews'] > 0
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\frame.py", line 2139, in __getitem__
return self._getitem_column(key)
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\frame.py", line 2146, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\generic.py", line 1842, in _get_item_cache
values = self._data.get(item)
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\internals.py", line 3843, in get
loc = self.items.get_loc(item)
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'n_reviews'
I think there might be something wrong going on at line 63 in ProductSpider product_spider.py and line 100 at items.py since the n_reviews isn't being created at all
I haven't tested yet, but it's probably because 'reviews' does not appear in the text of the product's page. So you can change this line:
https://github.com/prncc/steam-scraper/blob/master/steam/spiders/product_spider.py#L63
so it reads
loader.add_css('n_reviews', '.responsive_hidden', re='\(([\d,]+)\)')
I did after i saw that steam html codes have removed "reviews" from the number of reviews field. Still didn't work. Tested it out in the shell and there it works so i don't think the problems there. Even tried it with this
n_reviews = response.css('.responsive_hidden').re('\(([\d,]+)\)')
n_reviews = [int(r.replace(',', '')) for r in n_reviews]
n_reviews = max(n_reviews)
loader.add_value('n_reviews',n_reviews)
And changing the item.py line 100 to n_reviews = scrapy.Field(). Still didn't work... im really stumped. Dunno what do further. Somehow the n_reviews isn't even showing up at all. I would suspect that it was the formating/parsing problem if it showed up with empty/NaN values but it's not even created.
Thank you for the response. I'll try poking around some more
So a quick look at for example this one: http://store.steampowered.com/app/848270/Sky_Conqueror/
results in this being scraped:
{
"url": "http://store.steampowered.com/app/848270/Sky_Conqueror/",
"reviews_url": "http://steamcommunity.com/app/848270/reviews/?browsefilter=mostrecent&p=1",
"id": "848270",
"title": "Sky Conqueror",
"genres": ["Action", "Adventure", "Casual", "Indie"],
"developer": "Poseidon's kiss",
"publisher": "Poseidon's kiss",
"release_date": "2018-05-03",
"app_name": "Sky Conqueror",
"specs": ["Single-player", "Steam Achievements"],
"tags": ["Casual", "Action", "Adventure", "Indie"],
"early_access": false
}
which doesn't have a n_reviews
because it has No user reviews
. You can either put a catchall for when no entry is found to put 0
or modify the split script to not require n_reviews
using get('n_reviews', 0)
so it doesn't raise the KeyError
.
Interesting. Also found this one where the style is also different:
http://store.steampowered.com/app/831810/Bane_of_Asphodel/
8 user reviews
To be honest, the best thing to do seems to modify the code to access the microdata in the HTML.
<meta itemprop="reviewCount" content="3">
But that's beyond my scrapy/XPath skills. You could modify the regex to something like;
\(([\d,]+)\)|([\d,]+)\suser\sreviews
Yea, when there is too little user reviews to set a sentiment steam just posts the total ser reviews then so the scraper gets "# user reviews" in sentiment.
response.xpath('//meta[@itemprop="reviewCount"]/@content').extract()[0]
I went with this:
n_reviews = response.xpath('//meta[@itemprop="reviewCount"]/@content').extract()
n_reviews = '0' if len(n_reviews) == 0 else n_reviews[0]
loader.add_value('n_reviews', n_reviews)
Not the most elegant, but seems to be working.
I'm a bloody fool.... I had no idea i have to resstart/make a new virtualenv every time a make a change. None of the changes i was making had any effect because i was still in the same virtualenv. I even deleted the python files and the scraper was still working the same....
Srry. Have been on a wild goose chase whole day. It works now! Thanks for the help!
I looked into the problem with missing STeam user ids and it seems it has with steams profile urls not being consitent. Some profile urls just have the SteamID64 in them like this https://steamcommunity.com/profiles/76561198384621512/ And some have it like this where it's just the name and the reviewspider cant pick up the Steamid64 from the profile url https://steamcommunity.com/id/RollyPollyDwarfHeads/
The steamID64 cant be scraped in these cases but the steamID3 is stated in
<div class="apphub_friend_block" data-miniprofile="424355784">
so i'm thinking SteamID3 might be better to scrape with
Replaced line 28 in review_spider.py with
review.xpath('//div[@class="apphub_friend_block"]/@data-miniprofile').extract()[0]
And with works fine!
Thanks for all help, jonny, and once again thanks for a great Stean-Scraper!
One last extra thing: the urls = shuffle(urls)
in split_review_urls removes makes urls = None for some reason.... Removed it. I don't see a necessary reason for it to be there.
@RamiJabor and @jonnybazookatone If one of you wants to make a PR to integrate some of these fixes, that'd be great. Let me know either way?
I might do it in 1-2 weeks. Really busy with thesis work atm and hoping i can cram in the steam data into the report.
Also i have a question about the split_reviews_urls.py. I tried it with "--peices 1" and it didn't allow it with
step = int(math.ceil(float(n)/args.pieces))
is there a particular reason you made it so?
Wanted all the urls in one file so i could run one long continous scrape of all the reviews so i just copy-pasted all of them into one txt file
Hello! Thanks for this great scraper! I tried it with the test_urls and approximately half of the reviews were missing "user_id" completetly. Does this have something to do with steam, the scraper or my settings?