Open isedwards opened 3 years ago
The problem with specifying a table with one entry for every expected occurrence of a periodic observation is the difficulty of coping with missing values. The CliDE obs_daily structure seems to follow CLICOM (and earlier Met Office CliMaster and probably CliData) in defining one entity for each occurrence expected in the sequence (i.e. 31 daily values in a month). If some of those values are accumulations over a longer period, as would happen at weekends in a 5-day manual station, then the missing values must be estimates, and the original observations are difficult to use. When extracting (say) a monthly total from a weekdays-only station, using the 31 daily values would result in an estimated total. Using the weekend accumulations would result in a measured total unless one of the weekend periods also spanned a weekend. This reduces the number of estimated values in a year from every month to about 4 months each year (in 2020 just February and March for rain and max temperature; Feb, March, May and June for min temperature).
I have long argued that the Climsoft Element definitions should not include words like "Daily" because this is unnecessary. Use of the Period attribute means that the TMPMAX element definition can be used for daily values, multiple days such as weekends, or for the 12-hourly values in the synoptic code.
Just to add another piece of
information, after defining a variable in MCH, e.g., 'watertemp',
the administrator must click on a button to create the MySQL
tables for the new variable. These tables are named after the
variable, prepending two letters to indicate different time
frequencies as, in our example, dmwatertemp (monthly), ddwatertemp
(daily) and dtwatertemp (sub-daily). These would be the most
useful tables from my point of view, although MCH creates also
another type of daily tables plus tables for weekly and 10-days
values.
Both in MCH and the Spanish (AEMET) database, monthly dates are assigned to the first day of the month. In our Noverber 2020 example: 2020-11-01
In MIDAS and Climsoft, all periodic observations are assigned to the time of observation. For monthly observations, (such as the traditional monthly raingauge which used a large cylindrical can to collect the rain and was read with a measuring stick), this would be read (in UK practice) at (nominal) 0900 UTC on the first of each month. Therefore the reading in the database for 1 January 0900 UTC would have the monthly total for December. This means that "casting back" the reading to the applicable month has to be done in the extraction query process. This adds complexity to the extraction query, but significantly reduces the chance of errors during the recording and storing processes, especially for paper forms. It also means that the measurement can be recorded even if the date of reading is not exact (for example 1 January is a Public Holiday in the UK, so the actual reading may be done on 2 January), and an estimated value can be added without hiding the original reading.
Both in MCH and the Spanish (AEMET) database, monthly dates are assigned to the first day of the month. In our Noverber 2020 example: 2020-11-01
@jaggh - Do you mean monthly dates or monthly data?
Hi, I meant monthly dates. I was not talking about to which date
are data assigned; I was explaining that monthly data of, e.g.,
February 2020, had assigned the date 2020-02-01 in the monthly
data table. (In a daily data table, it would indicate that the
data are for February the first, but in the monthly table it just
indicate data for February).
Jose
@jaggh -Thanks,it is clear now.
@Steve-Palmer drew my attention to this repo. I work at the UK Met Office and am a user of MIDAS.
I'm not familiar with previous discussions you may have had, so apologies if some of this is not relevant. Some initial thoughts:
Storing values against the observation time is definitely best from a traceability and accuracy point of view - it reflects what actually happened.
In the daily rainfall table in MIDAS there is a 'day count' column. Accumulations have a day count > 1. Could you generalise this approach to variables with other periods? e.g. an hour count for hourly data or a month count for monthly data. This would give maximum flexibility and a uniform approach.
Careful thought would need to be given to gauges that are read monthly. If the reading was made a day late then you would want this to be accurately reflected in the database e.g. day count = 32 rather than month count = 1.
Storing the data against observation time and combining this with a count column does require the user to understand the intricacies of the reporting process though e.g. in the UK the max and min temps for a given calendar day come from different 24-hour periods. To try and solve this (and reduce mistakes) we implemented a second database schema that applies business rules to the raw observations. It returns the 'best available value' for a given variable and calendar date. Could you take this approach?
Creating monthly totals when there are accumulations is tricky. To some degree you can only work with what you've got i.e. if an accumulation spans the month end then it's impossible to obtain a genuine monthly total. The value should be stored (or returned, if this is happening on-the-fly) with a non-zero missing day count. My feeling is it would be wasted effort to try and come up with business rules to do anything more sophisticated. It should be up to the data owners to reapportion the accumulation into daily values as part of a QC process and then flag the entries accordingly.
In Climsoft, the period over which an observation was made can be recorded in the
period
field of the observationfinal table.In practice the element code is used to determine both the element/parameter being measured and also the period of the observation (with the observation time occuring at the end of the observation period). For example, element code 2 TMPMAX is defined as the daily maximum temperature.
The observation date for a monthly observation is recorded at the end of the observation period. Therefore is should not be necessarily to store partial dates (e.g. 2020-11). In Climsoft metadata tables, fields like station opening/closing dates accept text input since the full date may not be known.
Default time to nominal time for observation. 0600 UTC (Kenya), 0900 UTC (UK)
Examples for monthly aggregation for 2020-11
2020-12-01T0600
(time depends on met service obs time) - casting back 0600 2nd day of the month and finish with 0600 1st following month2020-11-30:T0600
Open and closed date ranges.
In CliDE the period is determined by the table that is used to store the observation, e.g. obs_daily.
Related comment from @Steve-Palmer