jdemaeyer / brightsky

JSON API for DWD's open weather data.
https://brightsky.dev/
MIT License
287 stars 18 forks source link

Improve fill-up mechanism for stations without full parameter set #129

Closed MatthiasDod closed 2 years ago

MatthiasDod commented 2 years ago

Hi, first of all thanks for providing the API. It makes life much easier!

But there seems to be a bug in the API. It is returning "null" even though there are values in the DWD source data files. I happens not permanently but e.g. 30. and 31.07 in the morning hours.

For this API call https://api.brightsky.dev/weather?date=2022-07-28T00:00:00&last_date=2022-07-30T00:00:00&dwd_station_id=04931&units=dwd launched on morning 30.07. you can find the data sources and JSON respons in the attached files.

Same happens today 31.07 from 3am to ~11:30am. Now (11:50am) API works fine again.

response_json.txt 10738-BEOB.csv

Thx, for your help.

jdemaeyer commented 2 years ago

Hi @MatthiasDod, thanks for the report!

I believe what happens here is the following (this is merely some technical documentation for myself, you can skip this if you're not interested in that ;)):

  1. The DWD publishes yesterday's historical weather for all stations around 10 am, creating a long backlog of to-be-parsed files for us. In particular, different weather parameters are published in separate files. This leads to partially-filled rows of historical weather records in the morning hours.
  2. Bright Sky will prefer historical weather records over current ones (and prefer those over forecast ones), as these have undergone more QA steps by the DWD. In other words, when querying during the "large parsing backlog" phase between ~10am and ~11am (the actual window is probably larger, as you point out above), Bright Sky will select a partially filled historical row as primary weather record source for the response.
  3. The above two points are well-known and accomodated for in Bright Sky, which will fill up the missing parameters in a partially-filled historical weather record with values from a similar record (usually the current record for the same location and timestamp).
  4. For performance reasons, the fill-up mechanism will try to fill all missing fields (i.e. almost all fields during the "long parsing backlog" window) from a single weather record.
  5. In this particular case, the current records for the Stuttgart-Echterdingen station never contain the cloud_cover, visibility, and condition fields (as well as the wind_gust_direction field, but that field is ignored in the fill-up mechanism).
  6. Through the combination of (4) and (5), the current records for Stuttgart-Echterdingen are never selected as fallback for the historical records of the same station. Given that there are no other available fallbacks available (because we filtered for that specifc station ID in the query and its forecast records expire after three hours), the partially-filled historical row is returned as-is, with many fields containing null.

This probably also affects many other stations where the DWD does not record all weather parameters. It seems that we can't get around softening up the performance-optimizing "must contain all missing fields" restraint in the fill-up mechanism, either by

  1. simply querying all remaining source IDs if the list of source IDs is small, or
  2. inserting a (or multiple) "at least one of the missing fields must be not null" (partial) fill-up before the current "all of the missing fields must be not null" fill-up.

There is also the option of completely refactoring the fill-up query or even the initial query to include database-side filling up using Postgres' window functions, although I vaguely remember playing around with that during the initial development and giving up on it for performance reasons.

I will try to prepare an implementation of option (2) and measure its impact on performance within the next days. :)

jdemaeyer commented 2 years ago

1efd721

A possible solution for this is now live. The performance penalty for performing a "some fields may still be missing" query first is about 25 % (for those queries that do need filling up). Given that our current p50 for the /weather endpoint is around 12 ms (p95 around 28 ms) that should be tolerable.

MatthiasDod commented 2 years ago

Thanks for the fix! I've testet it several days and it works now as expected.