prncc / steam-scraper

A pair of spiders for scraping product data and reviews from Steam.
https://intoli.com/blog/steam-scraper/
77 stars 39 forks source link

Missing User_ids and n_reviews #8

Open RamiJabor opened 6 years ago

RamiJabor commented 6 years ago

Hello! Thanks for this great scraper! I tried it with the test_urls and approximately half of the reviews were missing "user_id" completetly. Does this have something to do with steam, the scraper or my settings?

RamiJabor commented 6 years ago

I think there's another issue with my products_all.jl since its missing n_reviews too which is preventing me from moving on to review scraping. The example output is not matching what i get in my products_all.jl

{"url": "http://store.steampowered.com/app/800200/Witching_Tower_VR/", "reviews_url": "http://steamcommunity.com/app/800200/reviews/?browsefilter=mostrecent&p=1", "id": "800200", "title": "Witching Tower VR", "genres": ["Action", "Adventure", "Indie"], "developer": "Daily Magic Productions", "publisher": "Daily Magic Productions", "release_date": "Summer 2018", "app_name": "Witching Tower VR", "specs": ["Single-player"], "tags": ["Action", "Adventure", "Indie", "VR", "Violent", "Puzzle", "Atmospheric"], "early_access": false}

RamiJabor commented 6 years ago

Im getting this error when trying to run the the split_review_urls.

(env) C:\Users\Rami\steam-scraper\scripts>py split_review_urls.py --scraped-products C:\Users\Rami\steam-scraper\output/products_all.jl --output-dir C:\Users\Rami\steam-scraper\output
Traceback (most recent call last):
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'n_reviews'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "split_review_urls.py", line 71, in <module>
    main()
  File "split_review_urls.py", line 48, in main
    blx_has_reviews = df['n_reviews'] > 0
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\internals.py", line 3843, in get
    loc = self.items.get_loc(item)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'n_reviews'

I think there might be something wrong going on at line 63 in ProductSpider product_spider.py and line 100 at items.py since the n_reviews isn't being created at all

jonnybazookatone commented 6 years ago

I haven't tested yet, but it's probably because 'reviews' does not appear in the text of the product's page. So you can change this line:

https://github.com/prncc/steam-scraper/blob/master/steam/spiders/product_spider.py#L63

so it reads

loader.add_css('n_reviews', '.responsive_hidden', re='\(([\d,]+)\)')
RamiJabor commented 6 years ago

I did after i saw that steam html codes have removed "reviews" from the number of reviews field. Still didn't work. Tested it out in the shell and there it works so i don't think the problems there. Even tried it with this

    n_reviews = response.css('.responsive_hidden').re('\(([\d,]+)\)')
    n_reviews = [int(r.replace(',', '')) for r in n_reviews]
    n_reviews = max(n_reviews)
    loader.add_value('n_reviews',n_reviews)

And changing the item.py line 100 to n_reviews = scrapy.Field(). Still didn't work... im really stumped. Dunno what do further. Somehow the n_reviews isn't even showing up at all. I would suspect that it was the formating/parsing problem if it showed up with empty/NaN values but it's not even created.

Thank you for the response. I'll try poking around some more

jonnybazookatone commented 6 years ago

So a quick look at for example this one: http://store.steampowered.com/app/848270/Sky_Conqueror/

results in this being scraped:

{
  "url": "http://store.steampowered.com/app/848270/Sky_Conqueror/", 
  "reviews_url": "http://steamcommunity.com/app/848270/reviews/?browsefilter=mostrecent&p=1", 
  "id": "848270", 
  "title": "Sky Conqueror", 
  "genres": ["Action", "Adventure", "Casual", "Indie"], 
  "developer": "Poseidon's kiss", 
  "publisher": "Poseidon's kiss", 
  "release_date": "2018-05-03", 
  "app_name": "Sky Conqueror", 
  "specs": ["Single-player", "Steam Achievements"], 
  "tags": ["Casual", "Action", "Adventure", "Indie"], 
  "early_access": false
}

which doesn't have a n_reviews because it has No user reviews. You can either put a catchall for when no entry is found to put 0 or modify the split script to not require n_reviews using get('n_reviews', 0) so it doesn't raise the KeyError.

jonnybazookatone commented 6 years ago

Interesting. Also found this one where the style is also different:

http://store.steampowered.com/app/831810/Bane_of_Asphodel/

8 user reviews

jonnybazookatone commented 6 years ago

To be honest, the best thing to do seems to modify the code to access the microdata in the HTML.

<meta itemprop="reviewCount" content="3">

But that's beyond my scrapy/XPath skills. You could modify the regex to something like;

\(([\d,]+)\)|([\d,]+)\suser\sreviews
RamiJabor commented 6 years ago

Yea, when there is too little user reviews to set a sentiment steam just posts the total ser reviews then so the scraper gets "# user reviews" in sentiment.

jonnybazookatone commented 6 years ago
response.xpath('//meta[@itemprop="reviewCount"]/@content').extract()[0]
jonnybazookatone commented 6 years ago

I went with this:

n_reviews = response.xpath('//meta[@itemprop="reviewCount"]/@content').extract()
n_reviews = '0' if len(n_reviews) == 0 else n_reviews[0]
loader.add_value('n_reviews', n_reviews)

Not the most elegant, but seems to be working.

RamiJabor commented 6 years ago

I'm a bloody fool.... I had no idea i have to resstart/make a new virtualenv every time a make a change. None of the changes i was making had any effect because i was still in the same virtualenv. I even deleted the python files and the scraper was still working the same....

Srry. Have been on a wild goose chase whole day. It works now! Thanks for the help!

RamiJabor commented 6 years ago

I looked into the problem with missing STeam user ids and it seems it has with steams profile urls not being consitent. Some profile urls just have the SteamID64 in them like this https://steamcommunity.com/profiles/76561198384621512/ And some have it like this where it's just the name and the reviewspider cant pick up the Steamid64 from the profile url https://steamcommunity.com/id/RollyPollyDwarfHeads/

The steamID64 cant be scraped in these cases but the steamID3 is stated in <div class="apphub_friend_block" data-miniprofile="424355784"> so i'm thinking SteamID3 might be better to scrape with

Replaced line 28 in review_spider.py with review.xpath('//div[@class="apphub_friend_block"]/@data-miniprofile').extract()[0] And with works fine! Thanks for all help, jonny, and once again thanks for a great Stean-Scraper!

One last extra thing: the urls = shuffle(urls) in split_review_urls removes makes urls = None for some reason.... Removed it. I don't see a necessary reason for it to be there.

prncc commented 6 years ago

@RamiJabor and @jonnybazookatone If one of you wants to make a PR to integrate some of these fixes, that'd be great. Let me know either way?

RamiJabor commented 6 years ago

I might do it in 1-2 weeks. Really busy with thesis work atm and hoping i can cram in the steam data into the report.

Also i have a question about the split_reviews_urls.py. I tried it with "--peices 1" and it didn't allow it with step = int(math.ceil(float(n)/args.pieces)) is there a particular reason you made it so? Wanted all the urls in one file so i could run one long continous scrape of all the reviews so i just copy-pasted all of them into one txt file