prncc / steam-scraper

A pair of spiders for scraping product data and reviews from Steam.
https://intoli.com/blog/steam-scraper/
77 stars 39 forks source link

Decoding unicode escaped characters #6

Closed ArtyCooL closed 6 years ago

ArtyCooL commented 6 years ago

Hi! Thank you for this useful Scrapy spider! Beforehand I would like to say that I'm a newbie in using Python. I have an issue with foreign reviews (Russian, Chinese, Japanese etc.). In my output file (reviews.jl) these reviews display as \u0441\u043b\u0438\u0448\u043a\u043e\u043c etc. (after decoding it looks like this "слишком") Is there any workaround for this issue - any chance of changing the script code so that the review text will export correctly without unicode escaped characters.

Right now I'm using Notepad++ plugin called HTMLPad to Decode JS. It works but it can't decode large amount of text at once (26000 reviews for example), so I have to select 100-200 strings and decode them manually which is real pain in the ass for 26000 reviews...

prncc commented 6 years ago

Thanks for reporting. Scrapy deliberately encodes JSON files like that. I pushed a commit that disables that behavior by adding

FEED_EXPORT_ENCODING = 'utf-8'

to settings.py.