sigmf / sigmf-python

Easily interact with Signal Metadata Format (SigMF) recordings.
https://sigmf.org
GNU Lesser General Public License v3.0
44 stars 17 forks source link

[FEATURE] Pydantic backend to Data Validation #61

Open gregparkes opened 4 months ago

gregparkes commented 4 months ago

TL;DR - This PR is derived from issue #58 to automatically support data validation using Pydantic, a JSON and JSONschema-friendly validation library.

At this point, the PR only defines the schema and basic validations - I have not supplied any means to integrate it into the current library, so all existing behaviour with SigMFFile remains.

Changes

A number of files within the component directory (renamed?), main one being the pydantic_metadata.py script which contains a Pydantic definition from the JSONschema as specified on the main SigMF repository.

The pydantic_metadata.py script defines the SigMF Metadata Standard which includes:

Features

To the best of my ability, these classes mirror the defined JSONschema standard and go above and beyond in many ways, including the following features:

  1. core:datatype, version and DOI strings utilise regex patterns to ensure compliance (see pydantic_types.py).
  2. core:version (GlobalInfo), core:uuid (Annotation) and core:datetime (Capture) use default factories to fill automatically upon creation if not defined prior (auto-filling timestamps, version numbers etc)
  3. core:collection, core:dataset and core:license use Pathlib.Path and HttpUrl objects which supply extra functionality from Python core libraries when instantiated.
  4. Index attributes (such as core:sample_start) check for non negative or positive integer.
  5. Validation for mutual exclusivity between core:dataset and core:metadata_only.
  6. Captures and Annotations are automatically sorted by their respective core:sample_start.

How to use

Creating an object

I've added a helper method SigMFMetaFileSchema.from_file() which takes a .sigmf-meta file path and returns the Pydantic object for it.

Using the object

All of the attributes are reachable by using their name, e.g core:version becomes obj.global_info.version.

Exporting an object

Once a SigMFMetaFileSchema object is created, it can be exported to dictionary .model_dump() or JSON string (prior to storage in file, or over the network) using the .model_dump_json(by_alias=True, exclude_none=True) method. Setting by_alias and exclude_none to True is important to ensure the core attributes all begin with core: etc.

Accessing the schema

The JSON schema of the SigMFMetaFileSchema can be accessed using .model_json_schema(), allowing you to integrate with any legacy code using the schema.

Testing

I've supplied some unit tests in which seem to cover the basic cases, although a few extra real examples would be pretty handy, and I haven't properly checked (yet) how its outputs compare to the current outputs from SigMFFile.

Current code coverage results (pytest --cov=sigmf && coverage report):

Name Stmts Miss Branch BrPart Cover
sigmf/component/init.py 1 0 0 0 100%
sigmf/component/extensions/init.py 1 0 0 0 100%
sigmf/component/extensions/core.py 8 0 0 0 100%
sigmf/component/geo_json.py 31 0 8 0 100%
sigmf/component/pydantic_metadata.py 110 0 24 2 99%
sigmf/component/pydantic_types.py 7 0 0 0 100%

My pipeline I've been using is a Python 3.7 environment in Anaconda:

Next steps

At the moment there is no code for manipulating the Pydantic objects (aside from creation) to keep controller functionality separate from the 'data' component.

However supplying code to convert these objects into nested dictionaries / to file should be trivial.

Integration

Basically seeking some guidance and ideas as to how to integrate this into existing sigmf-python classes.

I would suggest introducing this as an optional backend in the next version, with it becoming the default option at the next release version.

Something like adding a backend=pydantic parameter to the sigmf.sigmffile.fromfile method or similar.

Also happy for any changes to names / suggestions to file or internal objects.

SigMF Collections

I've began an implementation of the SigMF collection standard, but I'm less familiar with this object so need to play around with it some more.

777arc commented 4 months ago

Was pydantic_metadata.py entirely auto generated off the json schema or were there any manual tweaks that needed to be made?

gregparkes commented 4 months ago

Was pydantic_metadata.py entirely auto generated off the json schema or were there any manual tweaks that needed to be made?

Unfortunately a decent number of manual tweaks needed to be made - in particular the autogeneration tool turned every variable from e.g core:generator in the schema into core_generator as a variable name.

This:

+ Maintains uniqueness of each variable, allows extensions to have the same variable name as a core attribute. - Makes the variable names longer, which is annoying to write and read.

The tool also generated mostly base Python types (e.g int, str, float) for each attribute and did not supply any special typing e.g regex-compliant strings, positive integers (e.g core:sample_count) and so on.

The custom validation and serialization code associated to each object is also not generated - as a number of the rules are specified in the SigMF standard documentation found here but not actually implemented in the underlying JSON schema - for example sorting the captures and annotations array by core:sample_start, or ensuring core:freq_upper_edge > core:freq_lower_edge. We solve this in Pydantic by ensuring these arrays are sorted in the validation process.