primap-community / unfccc_di_api

Python wrapper around the Flexible Query API of the UNFCCC.
https://unfccc-di-api.readthedocs.io
Other
8 stars 1 forks source link

JSONDecodeError when using unfccc-di-api #74

Closed JGuetschow closed 1 year ago

JGuetschow commented 1 year ago

Description

Just importing and initializing the reader throws a JSONDecodeError from requests.

What I Did

Just import and initialize the reader

import unfccc_di_api
reader = unfccc_di_api.UNFCCCApiReader()

The result is a JSONDecodeError in requests/models.py:971 with the message JSONDecodeError: Expecting value: line 1 column 1 (char 0).

Full error output

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File <repo_path>/venv/lib/python3.10/site-packages/requests/models.py:971, in Response.json(self, **kwargs)
    970 try:
--> 971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError

File /usr/lib/python3.10/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File /usr/lib/python3.10/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    333 """Return the Python representation of ``s`` (a ``str`` instance
    334 containing a JSON document).
    335 
    336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338 end = _w(s, end).end()

File /usr/lib/python3.10/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
Cell In[1], line 2
      1 import unfccc_di_api
----> 2 reader = unfccc_di_api.UNFCCCApiReader()

File <repo_path>/venv/lib/python3.10/site-packages/unfccc_di_api/unfccc_di_api.py:101, in UNFCCCApiReader.__init__(self, base_url)
     94 def __init__(self, *, base_url: str = "https://di.unfccc.int/api/"):
     95     """
     96     Parameters
     97     ----------
     98     base_url : str
     99         Location of the UNFCCC api.
    100     """
--> 101     self.annex_one_reader = UNFCCCSingleCategoryApiReader(
    102         party_category="annexOne", base_url=base_url
    103     )
    104     self.non_annex_one_reader = UNFCCCSingleCategoryApiReader(
    105         party_category="nonAnnexOne", base_url=base_url
    106     )
    108     self.parties = pd.concat(
    109         [self.annex_one_reader.parties, self.non_annex_one_reader.parties]
    110     ).sort_index()

File <repo_path>/venv/lib/python3.10/site-packages/unfccc_di_api/unfccc_di_api.py:221, in UNFCCCSingleCategoryApiReader.__init__(self, party_category, base_url)
    211 """
    212 Parameters
    213 ----------
   (...)
    217    Location of the UNFCCC api.
    218 """
    219 self.base_url = base_url
--> 221 parties_raw = self._get(f"parties/{party_category}")
    222 parties_entries = []
    223 for entry in parties_raw:

File <repo_path>/venv/lib/python3.10/site-packages/unfccc_di_api/unfccc_di_api.py:604, in UNFCCCSingleCategoryApiReader._get(self, component)
    602 resp = requests.get(self.base_url + component)
    603 resp.raise_for_status()
--> 604 return resp.json()

File <repo_path>/venv/lib/python3.10/site-packages/requests/models.py:975, in Response.json(self, **kwargs)
    971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
JGuetschow commented 1 year ago

The error can be reproduced with requests only:

import requests
test = requests.get("https://di.unfccc.int/api/years/single")
test.json()

results in

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The browser view (firefox) of the JSON returned by the api looks fine, however, the data obtained by requests is html code and not json:

<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=31&xinfo=14-17360989-0%202NNN%20RT%281685546429205%2083%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U18&incident_id=878000640024271345-87107726436339726&edet=12&cinfo=04000000&rpinfo=0&cts=0UQlKPkwfAwyoDgZtI7EynFBOZ1OJ78nPC1l38fn6ivu2LqRtkfY4GU2KZybus%2bV&mth=GET" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 878000640024271345-87107726436339726</iframe></body></html>

It says 'Request unsuccessful` but doesn't state why

JGuetschow commented 1 year ago

The DI api now seems to have the same anti-robot protection the rest oft the website has. With the following code I get a valid JSON:

import time
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

url = "https://di.unfccc.int"

# set options for headless mode
profile_path = ".firefox"
options = Options()
#options.add_argument('-headless')

# create profile for headless mode and automatic downloading
options.set_preference('profile', profile_path)
options.set_preference('browser.download.folderList', 2)

# set up selenium driver
driver = Firefox(options=options)
# visit the main data page once to create cookies
driver.get(url)

# wait a bit for the website to load before we get the cokkies
time.sleep(20)

# get the session id cookie
cookies_selenium = driver.get_cookies()
cookies = {}
for cookie in cookies_selenium:
    cookies[cookie['name']] = cookie['value']
import requests
#r = requests.get(url, stream=True, cookies=cookies)
test = requests.get("https://di.unfccc.int/api/years/single", cookies=cookies)

In the downloader scripts of the UNFCCC_non-annexI_data repository the cookie method does not work perfectly and I have tests if the result is the requested page / file or an error page and reset cookies in case of an error page. Not sure if this is necessary here as well.

mikapfl commented 1 year ago

Pretty annoying that they are doing this stuff now. Requiring selenium also means the API package will be much less useful because you need a fully set up browser and everything to use it. I'm not quite sure how to proceed, will think about it.

JGuetschow commented 1 year ago

maybe there is a different way than using selenium. I just tested the code that I use for the pdf downloading.

mikapfl commented 1 year ago

I don't think there is a way to use it without a full browser, at least not reliably. But of course, that is super annoying for us and other end users. Maybe we have to set up a cronjob somewhere to download this daily or on some other schedule, and offer it behind a sensible API ourselves? Man, them hiding this information is annoying.

mikapfl commented 1 year ago

I added a warning to the README.

mikapfl commented 1 year ago

On what timeline do you need some fix which works for us at least temporarily?

JGuetschow commented 1 year ago

I think I've downloaded the data pretty recently, so it's not that urgent. But I have to rewrite my code a bit because when generating the full dataset for all countries I currently check for updates. But I'll just add an option to use a specific time code for the individual country data (maybe it's already implemented)

mikapfl commented 1 year ago

Interestingly, we don't have any failures in our CI, see e.g. https://github.com/pik-primap/unfccc_di_api/actions/runs/5174540686/jobs/9320975109 . This seems to indicate that when running on github CI runners, the anti-bot measures are not in place (Microsoft getting special rights or something). One solution around this whole situation could therefore be to set up a (weekly?) github actions run which downloads all the data and uploads them somewhere else where we can then access them using a sane API. Stupid hack, but maybe more future-proof than other ideas.

danielhuppmann commented 1 year ago

FWIW, the pyam-tests are sometimes failing because of that error, see https://github.com/IAMconsortium/pyam/actions/runs/5165261947 - I’m considering removing the tests until the issue is fixed

danielhuppmann commented 1 year ago

Interestingly, though, I have only seen MacOS-tests fail so far - I thought that was because they are slowest, but maybe Linux/Windows servers are indeed exempt.

JGuetschow commented 1 year ago

I use linux and for me it fails.

mikapfl commented 1 year ago

FWIW, the pyam-tests are sometimes failing because of that error, see https://github.com/IAMconsortium/pyam/actions/runs/5165261947 - I’m considering removing the tests until the issue is fixed

You should definitely disable/remove/xfail any tests which depend on unfcc_di_api at the moment, we don't have a timeline until when this will be fixed.

mikapfl commented 1 year ago

With the new release 4.0.0, we have a work-around for this issue, in the form of the new ZenodoReader which reads from our data package instead of reading directly from the API.

At the moment, using the old UNFCCCApiReader class requires running the scripts on Azure (or other places which aren't blocked by the UNFCCC API) or using a hack with a cookie extracted from a running web browser. However, both methods could stop working at any point. Probably, getting the data out of the API will forever be a manual or otherwise bespoke process, and most users should just use the ZenodoReader. The downside is of course that it can only return the data from the latest version of the data package on zenodo. We have updated this data package approximately twice a year in the past.

@danielhuppmann You probably want to depend on unfccc-di-api >= 4.0.0 and use the ZenodoReader class. If you still want to support filtering by gases, you'll have to do the filtering in pyam or provide a patch. We only used that as a performance enhancement with the API, but with zenodo, it is faster to retrieve all gases and then filter in pandas.

joAschauer commented 11 months ago

Hi there, I recently had success to download NIR and CRF files from UNFCCC with requests by altering the User-Agent headers in requests.get() calls:

with requests.get(
            download_url, stream=True, headers={"User-Agent": UserAgent().random}
        ) as r:
    ...

I used UserAgent().random from https://pypi.org/project/fake-useragent/

Have you tried this already for the api @mikapfl?

JGuetschow commented 11 months ago

I do something similar for downloading BUR / NC / CRF submissions in the UNFCCC_non-AnnexI_data repository. There, I connect to the server using selenium and then reuse the cookies for requests. That works most of the time, but not always such that I also need a check if I got the data or just the error page and in the latter case repeat the process. Does your approach work every time?

mikapfl commented 11 months ago

Hi @joAschauer that actually makes it possible to run our test suite again on github (and probably generally run things on Azure), but unfortunately, it is not enough to run this in other non-Microsoft networks. But being able to run it on Azure again is super useful and definitely easier than the Selenium route, thanks!

Se it working in action in this PR: https://github.com/pik-primap/unfccc_di_api/pull/88

joAschauer commented 11 months ago

Cool, happy to hear that 🙂

Does your approach work every time?

@JGuetschow, my approach does also not work every single time and I had to apply the same procedure with retries as you described. @mikapfl maybe this is the reason why tests fail locally in #88. Did you try to run the tests several times?

mikapfl commented 11 months ago

Yeah, the local tests fail every time.

joAschauer commented 4 months ago

Here is a proof of concept to collect the latest data from the UNFCCC website via github actions: https://github.com/joAschauer/unfccc_di_github_action_download

mikapfl commented 4 months ago

@joAschauer Nice, thanks! I'll have a look to see if we can automate some of the manual stuff we are doing currently to produce data packages for zenodo.

mikapfl commented 4 months ago

I added a new issue #109 to track this.