opencdms-project / project-standards-team

👥🗎 This repository is the main collaborative space for the OpenCDMS Data Model Standard Working Group (DMS-WG)
1 stars 2 forks source link

Unique id vs composite primary key #9

Open isedwards opened 4 years ago

isedwards commented 4 years ago

Unlike MCH and Climsoft, which both use composite primary keys, all tables in the CliDE database schema contain an id that can uniquely reference each row (e.g. see obs_daily)

This has implications for API design once the unique id being used in a given system is known, e.g.

GET /observation/361842

and/or

GET /observation?recordedFrom=60018&describedBy=5&obsDatetime=2020-01-01T01:00Z&...

This has been discussed in Climsoft here: #climsoft/477

Steve-Palmer commented 4 years ago

See my comment https://github.com/climsoft/Climsoft/issues/477#issuecomment-667951202

I suggest that, if there are physical advantages to use of a single-column primary key, then this should only be implemented where there are strong and unavoidable controls for referential integrity, and that anyone other than a database administrator should only be presented with a view using a composite key. i.e. developers MUST NOT be able to use the example above: GET /observation/361842

martinschweitzer commented 4 years ago

The synthetic key/natural key debate is a tricky one, but I do have feelings about the API. I feel the API should expose 'natural' selectors - i.e. that map onto real world entities and that are discoverable from the API.

If I am reading Steve's comments above correctly, then I think I am in agreement with him - and agree with his example. However, does this also apply to a 'PUT' - i.e. updating a row in a table?

At this point, I am more comfortable with natural keys [e.g. (stn_num, datetime)] but also feel that this is an implementation decision and as such probably best dealt with later in the design. I am also happy to be convinced of the superiority of synthetic keys.

DenisStuber commented 4 years ago

Hi to all,

Considering the mails on data models and the different model designs I would suggest to write down our specifications before choosing any specific model. For example we could write down specifications on 3 domains:

* Domain 1 climate/meteorological rules : e.g. to allow different values for a same observation according the validity date and control phases, to allow different sensors for the same variable, handle gridded Data, WIGOS Metadata, etc. 
* Domain 2 International and standard rules (ISO, OGC, W3C, etc.) : e.g. METCE, Observations and Measurements , Data Provenance, Data Quality, GIS capacity, etc 
* Domain 3 future specifications : what are the new standards in terms of data management that could answer to domain 1 and 2, draft of database schemas (platform independent) to have constructive dialogues and solutions would be needed, check also systems outside the traditional meteorological world.... 

Some explanations about our discussions, where any solution has its advantages and its disadvantage

For CLIDATA " The old CLICOM design was time-wide. Each record was for one observation entity, but contained 31 entries, one for each day of the month (with nulls for day 29, 30 and 31 according to the month and leap year). When Clidata was first set up, it followed this design, but changed later. This was arguably the worst of all possible designs, leading to the definition of distinct entities such as "Temperature at 00Z", "Temperature at 06Z" etc. (though it was easy to explain to people used to using paper forms only). The pre-1996 Met Office Climaster and Rainmaster databases were also time-wide . " Steve Palmer

We will need to contact Radim Tolaz on that subject. CLIDATA is still very similar of the CLICOM tables. The CLICOM data model was not the worst, and we need to consider the daily data and the hourly data tables . The NMHS which chose a "Temperature at 06Z" was because they wanted to use Monthly table of Daily values. But they could have choose the Daily table of Hourly values without any problem. Here it is more a problem of (1) training and/or (2) performance of the system.

For CLICOM " CLICOM caused a lot of problems because the data entry modules directly addressed the database without the requirement for referential integrity (the underlying database was an RDBMS, but the Fortran code did not read the metadata). This meant that any mistyping of the station name in a data entry form did not result in rejection, but instead resulted in a new station being created without any other metadata. This was a conscious decision by the developers because they felt it was undesirable if key-entry forms were rejected. So very quickly, databases were full of multiple variants of the same station name, and multiple sets of overlapping key-entry data . " Steve Palmer.

Not at all. The CLICOM system underlined the fact to use the Fortran programs because the programs were reading the metadata and avoid having stations and parameters not declared inside the database. The other solution was for database administrators to import files straight to the DataEase database. And for that, the administrator had 2 solutions : (1) stay with the database constraints (but for big files you had your server blocked for weeks with nobody able to use CLICOM at daily process, impossible to imagine when you had power cut in developing countries) or (2) remove the constraints in knowing exactly what you were doing. When you are a database administrator, you still have those choices with any modern RDBMS and it s very often needed as far as you know what you are doing. Once again it is a question of competence/training and realty of the field.

See you soon, Denis.

I will ask Radim Tolaz, Rachid Sebarri and other specialists to be part of those discussions that are future-structuring

----- Météo-France ----- STUBER DENIS DCSC/POC denis.stuber@meteo.fr Fixe : +33 561078337

De: "Martin Schweitzer" notifications@github.com À: "opencdms/datamodel" datamodel@noreply.github.com Cc: "Subscribed" subscribed@noreply.github.com Envoyé: Lundi 3 Août 2020 13:10:32 Objet: Re: [opencdms/datamodel] Unique id vs composite primary key (#9)

The synthetic key/natural key debate is a tricky one, but I do have feelings about the API. I feel the API should expose 'natural' selectors - i.e. that map onto real world entities and that are discoverable from the API.

If I am reading Steve's comments above correctly, then I think I am in agreement with him - and agree with his example. However, does this also apply to a 'PUT' - i.e. updating a row in a table?

At this point, I am more comfortable with natural keys [e.g. (stn_num, datetime)] but also feel that this is an implementation decision and as such probably best dealt with later in the design. I am also happy to be convinced of the superiority of synthetic keys.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, [ https://github.com/opencdms/datamodel/issues/9#issuecomment-667962467 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ALLN7B2OUUPURH6ST42TNQ3R62LKRANCNFSM4PTBAFNQ | unsubscribe ] .