jdemaeyer / brightsky

JSON API for DWD's open weather data.
https://brightsky.dev/
MIT License
287 stars 18 forks source link

Support for higher frequency precipitation data #132

Closed ptoews closed 1 year ago

ptoews commented 1 year ago

Are there any plans to support providing higher frequency data, e.g. 1-minute intervals from here: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/recent/ ? The directory structure seems different, but the file content structure looks similar.

jdemaeyer commented 1 year ago

Hi again @ptoews!

I don't think we'll be supporting this dataset through the JSON API as it doesn't fit very well into our current hourly-data structure and would pretty much explode the size of our current production database from ~25 GB to somewhere in the vicinity of a terabyte.

However, as you note, the structure of the data is quite similar, so you can re-use the parsing components in brightsky.parsers to parse these files locally, e.g. like this:

# dwd_parsing.py

import datetime

from brightsky.parsers import ObservationsParser
from dateutil.tz import tzutc

class MinutelyPrecipitationParser(ObservationsParser):

    elements = {
        'precipitation': 'RS_01',
    }

    def parse_station_id(self, zf):
        return None

    def parse_lat_lon_history(self, zf, dwd_station_id):
        return {}

    def parse_reader(self, filename, reader, lat_lon_history):
        for row in reader:
            timestamp = datetime.datetime.strptime(
                row['MESS_DATUM'], '%Y%m%d%H%M').replace(tzinfo=tzutc())
            yield {
                'timestamp': timestamp,
                **self.parse_elements(row, None, None, None),
            }

def parse_1min(url):
    parsers = MinutelyPrecipitationParser(url=url)
    parsers.download()
    records = list(parsers.parse())
    parsers.cleanup()
    return records

used like:

In [1]: from dwd_parsing import parse_1min

In [2]: url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/recent/1minutenwerte_nieder_00902_akt.zip'

In [3]: parse_1min(url)[0]
Out[3]:
{'observation_type': 'historical',
 'dwd_station_id': None,
 'wmo_station_id': None,
 'timestamp': datetime.datetime(2022, 1, 1, 0, 0, tzinfo=tzutc()),
 'precipitation': 0.0}

I'm keeping this ticket open nonetheless to gauge if there's a lot of community interest for retrieving this data through the API.

ptoews commented 1 year ago

Hi @jdemaeyer, thank you for the detailed example, that's already helpful!

To be honest, I wasn't even aware that all the data goes through your databases. I've read through the Readme and some of the code now, am I understanding correctly that the purpose is to have an index of station locations, to find the nearest ones for a given query? But then why store the weather data as well, for efficiency/to reduce DWD API load?

Since you're understandably not going to integrate this kind of data into the databases for now, what I think would help me is a tool that takes a location and a time, and returns the weather data from DWD. The tool would need an index to find the closest station, and then simply build a DWD file url consisting of precipitation interval, station id, and time. The file at this URL would be parsed as you described already.

Am I missing something? Do you think this would be something that would be useful as part of brightsky?

jdemaeyer commented 1 year ago

[...] am I understanding correctly that the purpose is to have an index of station locations, to find the nearest ones for a given query? But then why store the weather data as well, for efficiency/to reduce DWD API load?

Both of these, although the station index only plays a minor role compared to the performance of receiving weather records once we know which stations to look at.

Transparently (re-)loading and (re-)parsing the data from the DWD server for every request would be a massive waste of resources. A typical weather record contains data from nine different files on the DWD server, each of which is somewhere between 50 and 100 kilobytes (because each file holds multi-year measurements for a different parameter for that station). Bright Sky currently receives a little shy of a thousand requests per minute. So if we wouldn't store the data in our own database we'd be requesting around a terabyte per day from the DWD server (roughly 100 Mbit/s). And that hasn't even gotten us started on the majestic amount of CPU power required to parse the same files over and over again, particularly if we want to reply within 12 ms on average like we currently do.

Am I missing something? Do you think this would be something that would be useful as part of brightsky?

While the approach you outline should work (but will be very inefficient for the reasons above), I don't think it'll land in Bright Sky. Particularly because it violates fair use principles: we would be building a service that eats more and more of someone else's resources as it grows (in this case the DWD's storage and bandwidth).

The /sources endpoint allows querying Bright Sky's lat-lon-to-station-id mapping without retrieving any weather records, maybe that can help you? From there you could easily build the URL to the 1 minute precipitation data and parse it like in my post above.

ptoews commented 1 year ago

That makes sense, that's a lot of data. The sources endpoint is very helpful, will use that for sure. Thanks!

jdemaeyer commented 1 year ago

(Closing in favour of #148, which contains more alternatives)