pdbxmmcifwg / diffrn-data-set-extension

PDBx mmCIF dictionary extension for diffraction data sets
7 stars 4 forks source link

How flexible is the relationship between data sections and data blocks intended to be? #3

Open pkeller opened 4 years ago

pkeller commented 4 years ago

In the data set examples, there is one data section per data block, and the name of the data block is the same as the value assigned to _pdbx_diffrn_data_section.id (and implicitly I guess to _datablock.id). My question is how flexible or strict do we want this relationship to be? For example:

IMHO it is best to decide on explicit rules about this kind of thing, rather than expecting developers to make inferences from examples.

epeisach commented 4 years ago

In my opinion:

_pdbx_diffrn_data_section_contents.data_section_id should point to a datablock - this is a table of contents. Then you could parse the data blocks you want - if the parser you use allows for this. (you could parse the file - but not store in memory data blocks you do not care for)

Within a datablock, pdbx_diffrn_data_section.id should match the datablock name - as a cross check.

Now - what should the rules be about the datablock names?

The only spatially in the file significant data block should be the first one - which lists what other data blocks exist.

pkeller commented 4 years ago

@CV-GPhL has proposed a structure that puts all data sections into a single datablock here: https://github.com/pdbxmmcifwg/diffrn-data-set-extension/issues/2#issuecomment-559104906. Although his example needs some modification, I think that we should consider carefully the relationship between data sections and datablocks, and make sure that we come to a consensus. It is a fundamental issue IMHO.

The current example has one data section per datablock, with an additional audit-type datablock at the top of the file. This makes links between data in different datablocks implicit, something that I consider problematic. I could make proposals to solve that, involving some (probably minor) changes to the DDL.

OTOH, agreeing to put everything in one datablock would allow the DDL category item_linked to be used to define the proper relationships between data sections. This would represent a big change to what has been done up to now, though. It would also involve some work on the data section extension dictionary: I would be happy to make proposals for changes if there is a consensus that this approach is worth exploring.

If this issue has already been discussed and decided before I started to become involved, please say so :wink:

jdwestbrook commented 4 years ago

From an archiving perspective we would prefer to adopt a modular packaging strategy similar to what is put forward in the current set examples. May I suggest that we leave the issue of managing relationships between data sections/blocks until the scope and particular content details is sorted. Once there is consensus on content, we can provide the simplest technical approach to managing the relationships between data sections that will support these content requirements.

pkeller commented 4 years ago

From an archiving perspective we would prefer to adopt a modular packaging strategy similar to what is put forward in the current set examples.

OK - so I'll work on the basis of no more than one data section per datablock, unless/until there is an explicit change.This implies that if an application does encounter a datablock that contains multiple data sections, that should be considered an error. In theory the contents of such a datablock could be split up into multiple databkocks, but that might be hard and/or error-prone.

Does anyone other than Ezra have any thoughts about the first question that I asked? Should the datablock name and the data section id always be the same? If so, should a mismatch be treated as an error, or perhaps the datablock be renamed to match the data section id?

May I suggest that we leave the issue of managing relationships between data sections/blocks until the scope and particular content details is sorted. Once there is consensus on content, we can provide the simplest technical approach to managing the relationships between data sections that will support these content requirements.

Sure, I have a few thoughts, but nothing concrete enough to call a proposal yet.