When trying to run the crawler as directed, I get the following error:
[DEBUG] Loading input file content.json
Traceback (most recent call last):
File "./crawler.py", line 97, in
answers = json.load(input_file)
File "/usr/lib/python3.4/json/init.py", line 268, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, kw)
File "/usr/lib/python3.4/json/init**.py", line 318, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.4/json/decoder.py", line 343, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.4/json/decoder.py", line 359, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1 column 133 (char 132)
Looking at the actual json-file it seems likely to me that the problem here is that I'm registered on Quora as "Eivind Kjørstad", and the ø gets encoded in the urls in a way the crawler does not approve of. Specifically the start of my json-file looks like this:
At a guess, the crawler disapprove of the "%C3%B8" part in my name. If I run sed on the json to take out those characters, then the crawler runs, but of course then the URLs it tries to fetch are incorrect and nothing is fetched.
Nope, it looks like the problem is that your text editor has word wrap turned on, which inserts newlines in the middle of the json string literals. Turn off word wrap and it should work.
When trying to run the crawler as directed, I get the following error:
[DEBUG] Loading input file content.json Traceback (most recent call last): File "./crawler.py", line 97, in
answers = json.load(input_file)
File "/usr/lib/python3.4/json/init.py", line 268, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, kw)
File "/usr/lib/python3.4/json/init**.py", line 318, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.4/json/decoder.py", line 343, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.4/json/decoder.py", line 359, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1 column 133 (char 132)
Looking at the actual json-file it seems likely to me that the problem here is that I'm registered on Quora as "Eivind Kjørstad", and the ø gets encoded in the urls in a way the crawler does not approve of. Specifically the start of my json-file looks like this:
[["https://www.quora.com/What-can-I-do-to-prevent-myself-from-getting-into-a-cycle-of-mediocrity/answer/Eivind-Kj%C3%B8rstad","Added 1h ago"],["https://www.quora.com/Laws-in-India/If-I-get-a-signed-document-from-my-wife-and-her-parents-saying-that-No-dowry-has-been-given-to-the-groom-and-his-family-in-this-marriage-and-other-necessary-details-can-my-wife-or-her-family-members-still-file-a-case-against-me-under-the-Dowry-Act/answer/Eivind-Kj%C3%B8rstad","Added 2h ago"],["https://www.quora.com/Why-do-so-many-men-on-Quora-use-the-word-females-to-refer-to-human-women/answer/Eivind-Kj%C3%B8rstad","Added Wed"]
At a guess, the crawler disapprove of the "%C3%B8" part in my name. If I run sed on the json to take out those characters, then the crawler runs, but of course then the URLs it tries to fetch are incorrect and nothing is fetched.