mesowx / MesoPy

A Python wrapper for the MesoWest environmental data API
MIT License
58 stars 23 forks source link

json.loads fails with >= 2 year request #21

Closed dnowacki-usgs closed 7 years ago

dnowacki-usgs commented 8 years ago

It seems that the MesoWest API does not return valid JSON data for long timeseries requests (apparently those longer than 2 years). Currently MesoPy does not catch the JSON error and returns the somewhat opaque ValueError: No JSON object could be decoded. A more user-friendly error message that suggests shortening the time range would be nice. I'm happy to submit a pull request implementing this change if this sounds useful.

joeyoun9 commented 8 years ago

Thanks for reporting this. With longer timeseries requests, you likely hit our duration limit, and we are aware that some of our error messages do not come out as proper JSON. Without an example query, I can't be sure, but in any case, if you submitted a pull request to more gracefully catch erroneous JSON output, we would definitely consider integrating it.

Thanks,

Joe

dnowacki-usgs commented 8 years ago

Thanks for the reply. Here are two example queries showing the 2 year threshold.

This one works (1 Jan 2013 00:01–1 Jan 2015 00:00, i.e. minute less than two years of data)

from MesoPy import Meso
m = Meso(token='my token')
ts = m.timeseries(stid='kwal', start='201301010001', end='201501010000', units='METRIC')

This one fails (1 Jan 2013 00:00–1 Jan 2015 00:00, i.e. exactly two years)

from MesoPy import Meso
m = Meso(token='my token')
ts = m.timeseries(stid='kwal', start='201301010000', end='201501010000', units='METRIC')

and results in the following error:

ValueErrorTraceback (most recent call last)
<ipython-input-89-24027e4e9447> in <module>()
----> 1 ts = m.timeseries(stid='astm2', start='201301010000', end='201501010000', units='METRIC')

/Users/dnowacki/anaconda/lib/python2.7/site-packages/MesoPy.pyc in timeseries(self, start, end, **kwargs)
    482         kwargs['token'] = self.token
    483 
--> 484         return self._get_response('stations/timeseries', kwargs)
    485 
    486     def climatology(self, startclim, endclim, **kwargs):

/Users/dnowacki/anaconda/lib/python2.7/site-packages/MesoPy.pyc in _get_response(self, endpoint, request_dict)
    160         except urllib.error.URLError:
    161             raise MesoPyError(http_error)
--> 162         return self._checkresponse(json.loads(resp.decode('utf-8')))
    163 
    164     def _check_geo_param(self, arg_list):

/Users/dnowacki/anaconda/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    337             parse_int is None and parse_float is None and
    338             parse_constant is None and object_pairs_hook is None and not kw):
--> 339         return _default_decoder.decode(s)
    340     if cls is None:
    341         cls = JSONDecoder

/Users/dnowacki/anaconda/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    362 
    363         """
--> 364         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    365         end = _w(s, end).end()
    366         if end != len(s):

/Users/dnowacki/anaconda/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
    380             obj, end = self.scan_once(s, idx)
    381         except StopIteration:
--> 382             raise ValueError("No JSON object could be decoded")
    383         return obj, end

ValueError: No JSON object could be decoded

I have a working fix (an additional try:, except ValueError: in _get_response()) that I'll make into a PR.

NicWayand commented 7 years ago

@dnowacki-usgs Any chance you put this into a PR (or branch) you can share?

I have been using a workaround but would be nice to get the fix!

dnowacki-usgs commented 7 years ago

@NicWayand I created a PR, it doesn't really fix the issue but it does catch the error. I wonder if the 2 year limit is a hard limit imposed by the MesoWest API.

NicWayand commented 7 years ago

Ah I see. Well the catch is appreciated! Guess that is what @joeyoun9 meant by their "duration limit"? I'll just stick with repeat calls then I guess.

johnhorel commented 7 years ago

Nic-

An even better option may be to bypass MesoPy and rely on the broader functionality now offered through the api services of synopticlabs. If there are key features in MesoPy that are not there, then let us know. We've been kicking around deprecating MesoPy as it has been pretty much overtaken by the api capabilities.

Regards

John

On Fri, Aug 18, 2017 at 1:42 PM, Nic Wayand notifications@github.com wrote:

Ah I see. Well the catch is appreciated! Guess that is what @joeyoun9 https://github.com/joeyoun9 meant by their "duration limit"? I'll just stick with repeat calls then I guess.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mesowx/MesoPy/issues/21#issuecomment-323444229, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkz16XFhLtKYD3TCSD0QNIq888Nb771ks5sZek-gaJpZM4IvGx1 .

NicWayand commented 7 years ago

Hi John, thanks for the suggestion. Calling the API directly through my browser lets me grab multiple years. But, isn't this what MesoPy is doing in python, wrapping the API url calls with urllib? I don't understand where the limit on number of years in MesoPy is coming from. I would like to continue to work in python (end goal is to download X stations over Y extent and convert to a netcdf file), but don't want to reinvent the wheel (MesoPy). Thanks!

johnhorel commented 7 years ago

Yep, I suspect the only thing MesoPy is really providing you is the urllib wrapper. I'll let one of the python gurus here comment further. I think when MesoPy was developed we did throttle it, as the api server at the time was constrained during development. Those limitations are not imposed by the api and you should be able to do what you want to do very efficiently.

Regards

john

On Fri, Aug 18, 2017 at 3:19 PM, Nic Wayand notifications@github.com wrote:

Hi John, thanks for the suggestion. Calling the API directly through my browser lets me grab multiple years. But, isn't this what MesoPy is doing in python, wrapping the API url calls with urllib? I don't understand where the limit on number of years in MesoPy is coming from. I would like to continue to work in python (end goal is to download X stations over Y extent and convert to a netcdf file), but don't want to reinvent the wheel (MesoPy).

My end goal in all of this is to download all stations with X variable over some Y extent and store as a netcdf file.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mesowx/MesoPy/issues/21#issuecomment-323463366, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkz17wLDJd9qwN5n2HfMx1qwl0jc_Pcks5sZf_ngaJpZM4IvGx1 .

NicWayand commented 7 years ago

Ok thanks John. Here is a test case of the urlib working for one year but not multiple years. Note the multiple years works when pasted into my browser!

import requests
url_one  = 'https://api.mesowest.net/v2/stations/timeseries?token=TOKENHERE&stid=kslc%20&start=201401010000&end=201506020000&vars=wind_speed'
url_all  = 'https://api.mesowest.net/v2/stations/timeseries?token=TOKENHERE&stid=kslc%20&start=199701010000&end=201506020000&vars=wind_speed'
# This works
oneYear = requests.get(url_one).json()
print(oneYear.keys())
# This doesn't
allYears = requests.get(url_all).json()
# Spits out: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
johnhorel commented 7 years ago

Interesting, but here it is directly and I get all years in my browser

https://api.mesowest.net/v2/stations/timeseries?token=demotoken&stid=kslc%20&start=199701010000&end=201506020000&vars=wind_speed

If you haven't done so already (you probably have), install jsonview in a chrome browser to see it all.

We've had some internal discussions to for sure improve the python/api linkages in the docs.

john

On Fri, Aug 18, 2017 at 3:55 PM, Nic Wayand notifications@github.com wrote:

Ok thanks John. Here is a test case of the urlib working for one year but not multiple years. Note the multiple years works when pasted into my browser!

import requests url_one = 'https://api.mesowest.net/v2/stations/timeseries?token=TOKENHERE&stid=kslc%20&start=201401010000&end=201506020000&vars=wind_speed' url_all = 'https://api.mesowest.net/v2/stations/timeseries?token=TOKENHERE&stid=kslc%20&start=199701010000&end=201506020000&vars=wind_speed'

This works

oneYear = requests.get(url_one).json() print(oneYear.keys())

This doesn't

allYears = requests.get(url_all).json()

Spits out: JSONDecodeError: Expecting value: line 1 column 1 (char 0)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mesowx/MesoPy/issues/21#issuecomment-323469554, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkz1wwDIEEZDyW1LKSMAL0KNOeMDxMYks5sZghRgaJpZM4IvGx1 .

adamabernathy commented 7 years ago

Sorry for getting in this late, I've been lounging around Martha's Vineyard for the last week taking in all the local seafood.

The issue here is not anyone's code or MesoPy, its that when the API returns a very large timeseries response the response gets converted to CSV to allow for applications to buffer and parse the response in chunks. By the nature of JSON its impossible to stream or chunk the data. Extremely large JSON payloads result in some serious overhead requirements to parse them.

This code example demonstrates the usual pattern I used to pull data from the Mesonet API and it has the parsing handling to catch for a CSV response. If you wanted to automate the CSV to whatever format you'd like to store the data in process I would add that code here:

try:
    payload = json.loads(response.read())
except:
    # Add CSV handler here.
    print 'JSON decode error.  More than likely a CSV response.'
    return

or you could break apart the fetch and parsing process of this function into two separate functions to handle this case.

I hope this helps a little bit.

NicWayand commented 7 years ago

Hi @adamabernathy, sounds like the best option for large station/time downloads is to save to csv first, then load into xarray. Below is a simple bash script using curl with your API. This works fine for my purposes, although it would be nice to have the download and save option in mesopy (is this what you are suggesting?).

#!/bin/bash

# Bash Downloader for mesowest API
# file {stations.txt} must be in run dir, containing a list of station ids to download.
# Replace with your token

base_url='https://api.mesowest.net/v2/stations/timeseries?token=demotoken&stid='
data_folder='test'
var='wind_speed'
d_start='201701010000'
d_end='201708180000'

mkdir -p $data_folder

for i in $(cat stations.txt); do
    curl $base_url$i"%20&start="d_start"&end="d_end"&vars="$var > $data_folder"/"$i".txt"
done

echo "Finished"
adamabernathy commented 7 years ago

@NicWayand, the roll over to CSV is a special case and for the purpose of MesoPy should be considered an invalid response. What I was referring to earlier is if you were to get a CSV response back from the API, that an appropriate solution would be create a routine to break up the requests into smaller time segments and then append them to the NetCDF (or disk) in an iterative process.

From an engineering standpoint reading extremely large blocks of data into memory can be dangerous. If you wanted to persist this data to a NetCDF or HDF file, you can easily append the data, rather than loading the entire dataset into memory before the write process. This keeps the overall overhead pretty low.

For this ticket, I'm going to close it out commit fd1b006e58d67e866845ea31ee4e2e2bab9e9d09 fulfills the JSON loads failure.