Long vs wide - Githubissues

isedwards commented 4 years ago

MCH and Climsoft store observation data in long format with one element per row (additionally, in MCH, each element has its own table).

CliDE stores observation data in wide format with fewer rows and one element per column (e.g. obs_daily)

Long format has performance issues when retrieving multiple elements for a given station/time.

Wide format leads to sparse tables where stations do not record all elements and complications when new elements need to be added.

Both approaches have implications for API design.

MCH and Climsoft will be slow to return multiple elements for many station/times.

CliDE would have to add many more columns if it accommodated additional elements including perhaps hydrological/atmospheric.

See also: Wide and narrow data and tidy data

martinschweitzer commented 4 years ago

My thoughts:

With modern CPUs and the cost of storage, I don't think that either the speed of retrieval or the space taken should be the major issue (I am assuming a database with less than 1TB of data - and for all of the countries that we support, there is much, much less than that).
I agree that if the element list needs to be very flexible, then having a vertical design may be more flexible than a horizontal design.
For many queries where we are selecting across a row (e.g. wind speed and direction), it may be simpler to work with a horizontal design.
CliDE has a somewhat hybrid design in that for the subdaily, we have a horizontal design, but then for cloud cover, there is a 'subtable'.
If we choose a vertical design, we can generally emulate a "read only" horizontal design through the use of views. While the reverse is also true, I cannot think of a use case why we would want to do it.
I am not sure of the implications for API design. Maybe you could expand on that.
For a naiive user wanting to do database queries, the horizontal design may be easier, but as mentioned above, this could be accommodated with views.

In ADAM (the Climate database for Australia) we store three times for each observation - UTC, Local Time and Local Time ignoring daylight saving. In a vertical design, each of the three times would be repeated for each element (unless the schema split the time of the observation off from the value).

If I were to write a CDMS from scratch, I would certainly consider using a vertical design - mostly because of the ease of adding new elements.

Finally, there is also an option of using a hybrid approach - e.g. storing "fundamental" elements in a horizontal table and storing less common elements either in a subtable or in a vertical form.

jaggh commented 4 years ago

I would like to add two considerations to the discussion: i) The MCH approach "one element, one table" provides extreme flexibility to users, who can define their own variables. E.g., in Guatemala MCH is used to store air quality variables, their manager being able to define NOx concentrations, etc. ii) Wind is precisely the most evident example of a bidimensional variable which would be better managed if direction and speed were stored in the same table. But it is not a big issue to have them in separate tables anyway...

Steve-Palmer commented 4 years ago

I think this description of the problem needs a little more elaboration. Whichever choice is made on the storage of entities will have advantages and disadvantages in performance. The two classes of enquiries which are most extreme will be either many observation entities from many stations for a short range of date/times, or a small number (often 1) entities for a long range of date/times for one station. Typically the extreme cases will be a daily summary with all the observation types for many stations, or a 30-year dataset of one or two types for a small number of stations for further analysis (such as the wind rose discussed elsewhere, which would use hourly windspeed and direction).

So there are four reasonable alternatives - time-wide, entity-wide, normalised to the observation entity level or fully normalised.

The old CLICOM design was time-wide. Each record was for one observation entity, but contained 31 entries, one for each day of the month (with nulls for day 29, 30 and 31 according to the month and leap year). When Clidata was first set up, it followed this design, but changed later. This was arguably the worst of all possible designs, leading to the definition of distinct entities such as "Temperature at 00Z", "Temperature at 06Z" etc. (though it was easy to explain to people used to using paper forms only). The pre-1996 Met Office Climaster and Rainmaster databases were also time-wide.

The Climsoft data model is actually fully normalised - each row in the Observations table contains one observation for one element, with its associated datetime of observation, location and entity type (plus QC info). This has the disadvantage of the main table having many rows, but the advantage that there is no wasted space. Another advantage is that it is easy to add new entity types (such as river level gauging).

MCH groups the observation entity-types into tables, one per entity. Disadvantage is that for any query which needs more than one observation-entity, multilple tables need to be opened, which may impose an overhead.

MIDAS originally used an entity-wide approach - each observation row has a full set of observation entities. For many of the columns in this table, the entries will be missing.For this reason, MIDAS has separate tables for rainfall, because otherwise there would be huge amounts of wasted space for the many raingauge-only observation stations. The big disadvantage of an entity-wide approach is that addition of a new entity (e.g. some of the new entities added when SYNOP was replaced by BUFR) is very difficult without adding new tables. I believe that (despite what it says in the User Manual), the physical implementation of MIDAS has changed to make it fully normalised, but the user View makes it appear as the original entity-wide implementation.

There are some physical options which can be explored, including segmentation of the database. The volatility of the data is an issue here - most write operations on a datum will occur within two years of the observation date (except in the dase of historic data rescue). After two years, any further QC changes are unlikely. In addition, most queries will require data within one month or one year of the observation time. There may be linked issues about the backup strategy.

When MIDAS was first set up, there were physical advantages from segmenting the observation tables by time, with the first three years held in a current-data segment, and older data held in an archive segment, which was further split by region. WIth later implementations, this physical advantage disappeared, and the segmentation removed. Segmentation was also explored in earlier version of Climsoft, as a way to get round the physical limits of Microsoft Access on the size of tables. Again, ths need disappeared when the datrabase was put into mySQL, but the recommendation for an off-site physical backup of the non-volatile older data remains.

@martinschweitzer in his comments under this topic refers to storage of observation time. I think this needs discussion under a separate issue.

martinschweitzer commented 4 years ago

While I don't want to muddy the waters, there are other options

A schemaless (or NoSQL) database.
A regular relational database, but with a JSON blob for observations, ie.

Date.	stn_num.	Observations
2020-01-01	10011	{'rain':20, 'tmax': 15, 'tmin':12'...} etc.

I know that both Oracle and PostgreSQL have very good support for JSON (e.g. https://www.postgresql.org/docs/9.4/datatype-json.html)

Option 2. may be worth considering.

Steve-Palmer commented 4 years ago

It is worth testing options, but the overall requirement is that the business rules must be enforced, and developers must not be able to bypass these for any reasons (including speed of execution).

CLICOM caused a lot of problems because the data entry modules directly addressed the database without the requirement for referential integrity (the underlying database was an RDBMS, but the Fortran code did not read the metadata). This meant that any mistyping of the station name in a data entry form did not result in rejection, but instead resulted in a new station being created without any other metadata. This was a conscious decision by the developers because they felt it was undesirable if key-entry forms were rejected. So very quickly, databases were full of multiple variants of the same station name, and multiple sets of overlapping key-entry data.

martinschweitzer commented 4 years ago

Just for the record, I fully support the notion of constraints.

For our data rescue we had tables with loosened constraints that allowed us to enter what was on the form exactly as it was written on the form - this was then migrated to the "climate" database with the stronger constraints. It was felt that it was important to also preserve the "mistakes". The same can be argued for AWS observations where there may be spikes or noise.

Although I cannot speak with authority, PosstgreSQL does appear to support constraints on JSON fields.

I am still very much in two minds about JSON in a database. On the surface it appears to violate some of the basic principles of a relational database, but at the same time it seems to have some very useful features. Until recently, I strong discouraged my developers from using JSON, but then relaxed that attitude when seeing how some solutions became far more elegant when using JSON.

opencdms-project / project-standards-team

Long vs wide #10