Closed JGuetschow closed 1 year ago
The error can be reproduced with requests
only:
import requests
test = requests.get("https://di.unfccc.int/api/years/single")
test.json()
results in
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The browser view (firefox) of the JSON returned by the api looks fine, however, the data obtained by requests is html code and not json:
<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=31&xinfo=14-17360989-0%202NNN%20RT%281685546429205%2083%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U18&incident_id=878000640024271345-87107726436339726&edet=12&cinfo=04000000&rpinfo=0&cts=0UQlKPkwfAwyoDgZtI7EynFBOZ1OJ78nPC1l38fn6ivu2LqRtkfY4GU2KZybus%2bV&mth=GET" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 878000640024271345-87107726436339726</iframe></body></html>
It says 'Request unsuccessful` but doesn't state why
The DI api now seems to have the same anti-robot protection the rest oft the website has. With the following code I get a valid JSON:
import time
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
url = "https://di.unfccc.int"
# set options for headless mode
profile_path = ".firefox"
options = Options()
#options.add_argument('-headless')
# create profile for headless mode and automatic downloading
options.set_preference('profile', profile_path)
options.set_preference('browser.download.folderList', 2)
# set up selenium driver
driver = Firefox(options=options)
# visit the main data page once to create cookies
driver.get(url)
# wait a bit for the website to load before we get the cokkies
time.sleep(20)
# get the session id cookie
cookies_selenium = driver.get_cookies()
cookies = {}
for cookie in cookies_selenium:
cookies[cookie['name']] = cookie['value']
import requests
#r = requests.get(url, stream=True, cookies=cookies)
test = requests.get("https://di.unfccc.int/api/years/single", cookies=cookies)
In the downloader scripts of the UNFCCC_non-annexI_data repository the cookie method does not work perfectly and I have tests if the result is the requested page / file or an error page and reset cookies in case of an error page. Not sure if this is necessary here as well.
Pretty annoying that they are doing this stuff now. Requiring selenium also means the API package will be much less useful because you need a fully set up browser and everything to use it. I'm not quite sure how to proceed, will think about it.
maybe there is a different way than using selenium. I just tested the code that I use for the pdf downloading.
I don't think there is a way to use it without a full browser, at least not reliably. But of course, that is super annoying for us and other end users. Maybe we have to set up a cronjob somewhere to download this daily or on some other schedule, and offer it behind a sensible API ourselves? Man, them hiding this information is annoying.
On what timeline do you need some fix which works for us at least temporarily?
I think I've downloaded the data pretty recently, so it's not that urgent. But I have to rewrite my code a bit because when generating the full dataset for all countries I currently check for updates. But I'll just add an option to use a specific time code for the individual country data (maybe it's already implemented)
Interestingly, we don't have any failures in our CI, see e.g. https://github.com/pik-primap/unfccc_di_api/actions/runs/5174540686/jobs/9320975109 . This seems to indicate that when running on github CI runners, the anti-bot measures are not in place (Microsoft getting special rights or something). One solution around this whole situation could therefore be to set up a (weekly?) github actions run which downloads all the data and uploads them somewhere else where we can then access them using a sane API. Stupid hack, but maybe more future-proof than other ideas.
FWIW, the pyam-tests are sometimes failing because of that error, see https://github.com/IAMconsortium/pyam/actions/runs/5165261947 - I’m considering removing the tests until the issue is fixed
Interestingly, though, I have only seen MacOS-tests fail so far - I thought that was because they are slowest, but maybe Linux/Windows servers are indeed exempt.
I use linux and for me it fails.
FWIW, the pyam-tests are sometimes failing because of that error, see https://github.com/IAMconsortium/pyam/actions/runs/5165261947 - I’m considering removing the tests until the issue is fixed
You should definitely disable/remove/xfail any tests which depend on unfcc_di_api at the moment, we don't have a timeline until when this will be fixed.
With the new release 4.0.0, we have a work-around for this issue, in the form of the new ZenodoReader which reads from our data package instead of reading directly from the API.
At the moment, using the old UNFCCCApiReader class requires running the scripts on Azure (or other places which aren't blocked by the UNFCCC API) or using a hack with a cookie extracted from a running web browser. However, both methods could stop working at any point. Probably, getting the data out of the API will forever be a manual or otherwise bespoke process, and most users should just use the ZenodoReader. The downside is of course that it can only return the data from the latest version of the data package on zenodo. We have updated this data package approximately twice a year in the past.
@danielhuppmann You probably want to depend on unfccc-di-api >= 4.0.0 and use the ZenodoReader class. If you still want to support filtering by gases, you'll have to do the filtering in pyam or provide a patch. We only used that as a performance enhancement with the API, but with zenodo, it is faster to retrieve all gases and then filter in pandas.
Hi there, I recently had success to download NIR and CRF files from UNFCCC with requests by altering the User-Agent headers in requests.get()
calls:
with requests.get(
download_url, stream=True, headers={"User-Agent": UserAgent().random}
) as r:
...
I used UserAgent().random
from https://pypi.org/project/fake-useragent/
Have you tried this already for the api @mikapfl?
I do something similar for downloading BUR / NC / CRF submissions in the UNFCCC_non-AnnexI_data repository. There, I connect to the server using selenium and then reuse the cookies for requests. That works most of the time, but not always such that I also need a check if I got the data or just the error page and in the latter case repeat the process. Does your approach work every time?
Hi @joAschauer that actually makes it possible to run our test suite again on github (and probably generally run things on Azure), but unfortunately, it is not enough to run this in other non-Microsoft networks. But being able to run it on Azure again is super useful and definitely easier than the Selenium route, thanks!
Se it working in action in this PR: https://github.com/pik-primap/unfccc_di_api/pull/88
Cool, happy to hear that 🙂
Does your approach work every time?
@JGuetschow, my approach does also not work every single time and I had to apply the same procedure with retries as you described. @mikapfl maybe this is the reason why tests fail locally in #88. Did you try to run the tests several times?
Yeah, the local tests fail every time.
Here is a proof of concept to collect the latest data from the UNFCCC website via github actions: https://github.com/joAschauer/unfccc_di_github_action_download
@joAschauer Nice, thanks! I'll have a look to see if we can automate some of the manual stuff we are doing currently to produce data packages for zenodo.
I added a new issue #109 to track this.
Description
Just importing and initializing the reader throws a
JSONDecodeError
from requests.What I Did
Just import and initialize the reader
The result is a
JSONDecodeError
inrequests/models.py:971
with the messageJSONDecodeError: Expecting value: line 1 column 1 (char 0)
.Full error output