iancoleman / cia_world_factbook_api

Converts the CIA World Factbook into a json data structure
MIT License
199 stars 28 forks source link

Improve blacklisting and repeat fetching #5

Open iancoleman opened 7 years ago

iancoleman commented 7 years ago

Some pages require blacklisting due to their content (or lack of content).

Investigate these incidences and ensure the blacklisting functionality of the scraper (ie fetch.py) is working correctly.

Sometimes archive.org returns an error status (eg 500) or an error page (eg containing the content Connection Failure).

Detect these events and pause, then repeat the fetch until it works as desired. Have some sort of backoff on the repeat so archive.org doesn't get hit too frequently.

iancoleman commented 7 years ago

Another error to catch when parsing yearlySummary as per traceback

Removing outdated yearly summary: /home/user/cia_data/country_html/yearly_summaries/https%3A%2F%2Fweb.archive.org%2F__wb%2Fcalendarcaptures%3Furl%3Dhttps%253A%252F%252Fwww.cia.gov%252Flibrary%252Fpublications%252Fthe-world-factbook%252Fgeos%252Fau.html%26selected_year%3D2017
Fetching https://web.archive.org/__wb/calendarcaptures?url=https%3A%2F%2Fwww.cia.gov%2Flibrary%2Fpublications%2Fthe-world-factbook%2Fgeos%2Fau.html&selected_year=2017
Traceback (most recent call last):
  File "fetch.py", line 266, in <module>
    getPage(countryPage)
  File "fetch.py", line 59, in getPage
    data = json.loads(yearlySummaryContent)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)