t3nsor / quora-backup

Python scripts to download Quora answers and convert them into a more portable form
GNU General Public License v2.0
125 stars 72 forks source link

Crawler fails for people with non-ascii names #1

Closed eivindorama closed 9 years ago

eivindorama commented 9 years ago

When trying to run the crawler as directed, I get the following error:

[DEBUG] Loading input file content.json Traceback (most recent call last): File "./crawler.py", line 97, in answers = json.load(input_file) File "/usr/lib/python3.4/json/init.py", line 268, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, kw) File "/usr/lib/python3.4/json/init**.py", line 318, in loads return _default_decoder.decode(s) File "/usr/lib/python3.4/json/decoder.py", line 343, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.4/json/decoder.py", line 359, in raw_decode obj, end = self.scan_once(s, idx) ValueError: Invalid control character at: line 1 column 133 (char 132)

Looking at the actual json-file it seems likely to me that the problem here is that I'm registered on Quora as "Eivind Kjørstad", and the ø gets encoded in the urls in a way the crawler does not approve of. Specifically the start of my json-file looks like this:

[["https://www.quora.com/What-can-I-do-to-prevent-myself-from-getting-into-a-cycle-of-mediocrity/answer/Eivind-Kj%C3%B8rstad","Added 1h ago"],["https://www.quora.com/Laws-in-India/If-I-get-a-signed-document-from-my-wife-and-her-parents-saying-that-No-dowry-has-been-given-to-the-groom-and-his-family-in-this-marriage-and-other-necessary-details-can-my-wife-or-her-family-members-still-file-a-case-against-me-under-the-Dowry-Act/answer/Eivind-Kj%C3%B8rstad","Added 2h ago"],["https://www.quora.com/Why-do-so-many-men-on-Quora-use-the-word-females-to-refer-to-human-women/answer/Eivind-Kj%C3%B8rstad","Added Wed"]

At a guess, the crawler disapprove of the "%C3%B8" part in my name. If I run sed on the json to take out those characters, then the crawler runs, but of course then the URLs it tries to fetch are incorrect and nothing is fetched.

t3nsor commented 9 years ago

Nope, it looks like the problem is that your text editor has word wrap turned on, which inserts newlines in the middle of the json string literals. Turn off word wrap and it should work.

xorshed738 commented 1 year ago

Hi