Increase granularity to 10 minutes

philippdzm commented 1 year ago

Hi, thanks for the great work.

As the DWD provides 10 minutes data, I was wondering if the brightsky API can be told to return 10 minutes data if available?

E.g. The response to <domain>/weather?date=2023-05-14&last_date=2023-05-15&tz=UTC&units=dwd&dwd_station_id=<id-1>,<id-2> returns hourly data. Concrete question: Can it be 10-minutes-data?

jdemaeyer commented 1 year ago

Hi @philippdzm, thanks for the kind words!

That is currently not possible directly in Bright Sky and probably won't be added in the near future as it would dramatically increase the size of our database and require a lot of deliberation on how to absorb non-hourly data into our current data structure and how (whether) to mix in the parameters that are only available hourly.

However, a lot of this hassle could be mitigated by adding a separate endpoint (and database table) for the 10-minute data, and only storing the past e.g. 7 days. Would that work for your use case?

Here are two alternatives that might work depending on your use case without having to wait for me to implement a new endpoint:

1. Use rainfall radar data (if all you need is precipitation)

The new radar endpoint (#144) provides hyperlocal (1 km² grid cell size) precipitation data in 5-minute intervals, including 5-minute forecasts for the next two hours. Selecting by lat/lon is currently not possible (but will come soon), for now you'll have to find the nearest pixel to your station in this giant array and provide a corresponding bbox.

2. Manually perform parsing in Python

Bright Sky's parsing core lives in the dwdparse package, and you can subclass its parsers for the ten-minute-data, e.g. for precipitation data files:

import datetime
import re

from dwdparse.parsers import ObservationsParser

FILENAME = '10minutenwerte_nieder_01766_akt.zip'

class TenMinutePrecipitationParser(ObservationsParser):

    elements = {
        'precipitation_10': 'RWS_10',
    }

    def parse_station_id(self, zf, **extra):
        for filename in zf.namelist():
            if (m := re.search(r'_(\d+)\.txt', filename)):
                return m.group(1)

    def parse_lat_lon_history(self, zf, dwd_station_id, **extra):
        """Not available in 10-minute-files"""
        return {}

    def parse_reader(self, filename, reader, lat_lon_history):
        for row in reader:
            timestamp = datetime.datetime.strptime(
                row['MESS_DATUM'],
                '%Y%m%d%H%M',
            ).replace(
                tzinfo=datetime.timezone.utc,
            )
            yield {
                'source': f'Observations:Recent:{filename}',
                'timestamp': timestamp,
                **self.parse_elements(row, None, None, None),
            }

p = TenMinutePrecipitationParser()
for record in p.parse(FILENAME):
    print(record)

—

Would any of these work for you?

(Previous discussion: https://github.com/jdemaeyer/brightsky/issues/132)

philippdzm commented 1 year ago

Hi @jdemaeyer

it would dramatically increase the size of our database

I agree, it should be optional if it gets added.

However, a lot of this hassle could be mitigated by adding a separate endpoint (and database table) for the 10-minute data, and only storing the past e.g. 7 days. Would that work for your use case?

yes, for my use case, this would be perfect. 7 days is enough (could become configurable).

Side-notes to this:

As the weather data is timeseries data, have you considered InfluxDB or TimescaleDB?
What would also reduce the size: I think, when someone is self-hosting your solution using the infrastructure repository, that person does not need the data to all stations available. Maybe one could mask relevant stations and that way avoid downloading 90% irrelevant data.

Manually perform parsing in Python

Thank you for the provided code. Nicely done subclassing for the 10 minutes data! I'll look at it next week.

For now, I was able to rig up a solution which avoids downloading the file but treat it right a way and extract the data I need (air temperature):

def fetch_data(url_to_file):
    # fetch
    r = requests.get(url_to_file)

    # Create a BytesIO object from the request's content
    z = zipfile.ZipFile(BytesIO(r.content))

    # Assuming there's a single txt file
    txt_file_name = z.namelist()[0]
    txt_file_content = StringIO(z.open(txt_file_name).read().decode('utf-8'))

    # Parse the text file using pandas
    data = pandas.read_csv(txt_file_content, sep=';')
    data.MESS_DATUM = pandas.to_datetime(data.MESS_DATUM, format='%Y%m%d%H%M', utc=True)
    data = data.set_index('MESS_DATUM')

    # return air temperature only
    return data.TT_10

jdemaeyer / brightsky

Increase granularity to 10 minutes #148

1. Use rainfall radar data (if all you need is precipitation)

2. Manually perform parsing in Python

—