openownership / lib-cove-bods

Check that your data complies with the Beneficial Ownership Data Standard (BODS) using our install our data review library to analyse files via your command line interface
https://datareview.openownership.org/
Other
1 stars 0 forks source link

Updated check: version number #124

Open kathryn-ods opened 3 months ago

kathryn-ods commented 3 months ago

Currently in cove as inconsistent_schema_version_used this needs to be rewritten to allow for inconsistent minor versions.

Check: all Statements MUST have the same major version number.

On fail:

Error message: Statements have different major version numbers. Info message: Version number (bodsVersion): [VALUE], Version number (bodsVersion): [VALUE2]

kathryn-ods commented 3 months ago

@radix0000 does it make sense to implement this test at this point in time? Because it's only invalid if the major values don't match one of the invalid values would need to include a statement with e.g. "1.0" and "0.4" as 1.0 doesn't exist yet would that be flagged up for not being a valid bods version as well as having inconsistent values?

kathryn-ods commented 3 months ago

@kd-ods you might be able to advise on the above now you're back

kd-ods commented 2 weeks ago

This is a special kind of check, since the outcome relates to how the whole dataset is processed. I think we should hold off implementing this. Pre- v1 things are having to be handled a little differently.

For future reference this is where I think we are and where we are going:

At this point (following the BODS 0.4 release)

When it comes to the DRT 'choosing' which version of the schema to validate a dataset against. It looks at the first statement in the dataset and:

@radix0000 - is that right? (We should document exactly what the process is.)

After BODS v1

This check, that 'all Statements MUST have the same major version number.' is done as part of the initial parsing of the data.

On fail: the dataset is not validated and the user gets an informative error message

On pass (case (a)): the dataset is validated against BODS 0.1

On pass (case (b)): the dataset is validated against the the latest MINOR.PATCH version release for the given MAJOR version number.

Reflections

Having worked through all that.... maybe post BODS v1 we should actually do a complete overhaul of the DRT too. We could relegate work so far to a 'beta' version then clean everything up for a v1 of the DRT. Then direct pre BODS v1 users to the beta version of the tool and BODS v1 + users to the new release. Then we don't need to maintain any overly-complicated BODS version-handling.

radix0000 commented 2 weeks ago

@kd-ods Re DRT choosing a schema version, it is slightly more complicated that (because as well as not being present, the cases where bodsVersion isn't a string, or isn't in list of known versions need to be covered), but the main tweak I have introduced is that it detects whether it is record-based (i.e. if it has "recordDetails", "recordId" or "recordType" in the statement), and if so it doesn't use BODS 0.1 as the default, instead it uses the latest version (i.e. currently 0.4). Having these 2 categories record-based and non-record-based and having different defaults for each seems sensible to me (given how different they are) but let me know what you think. There is a question of what the best defaults are as well (e.g. out of 0.1, 0.2, and 0.3 what is the "most used" version and should we be using that as the default for non-record-based data?).

kd-ods commented 1 week ago

Ah, thanks @radix0000. So is this a correct summary of what happens atm?

  1. The entire dataset is validated against a single schema version.

  2. The schema version is selected based on the contents of the first Statement in the array.

  3. If that first statement is 'record-based' the whole dataset is validated against bodsVersion (if it is present and valid). If that field is not present and valid then validation is against BODS 0.4.

  4. If that first statement is not record-based the whole dataset is validated against bodsVersion (if it is present and valid). If that field is not present and valid then validation is against BODS 0.1.

(If so - that looks sensible to me.)