Open iancoleman opened 7 years ago
Another error to catch when parsing yearlySummary as per traceback
Removing outdated yearly summary: /home/user/cia_data/country_html/yearly_summaries/https%3A%2F%2Fweb.archive.org%2F__wb%2Fcalendarcaptures%3Furl%3Dhttps%253A%252F%252Fwww.cia.gov%252Flibrary%252Fpublications%252Fthe-world-factbook%252Fgeos%252Fau.html%26selected_year%3D2017
Fetching https://web.archive.org/__wb/calendarcaptures?url=https%3A%2F%2Fwww.cia.gov%2Flibrary%2Fpublications%2Fthe-world-factbook%2Fgeos%2Fau.html&selected_year=2017
Traceback (most recent call last):
File "fetch.py", line 266, in <module>
getPage(countryPage)
File "fetch.py", line 59, in getPage
data = json.loads(yearlySummaryContent)
File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Some pages require blacklisting due to their content (or lack of content).
Investigate these incidences and ensure the blacklisting functionality of the scraper (ie fetch.py) is working correctly.
Sometimes archive.org returns an error status (eg 500) or an error page (eg containing the content Connection Failure).
Detect these events and pause, then repeat the fetch until it works as desired. Have some sort of backoff on the repeat so archive.org doesn't get hit too frequently.