[MAJOR FEATURE] Overhaul library to use Data Validation via Pydantic

Aim

I would propose to overhaul the library as it currently stands to introduce a dependency on pydantic, an excellent data validation library that works hand-in-glove with JSON and JSON schemas.

Let me know what you think of the below, and I'd consider developing a basic PR.

Note: I'm aware that Pydantic requires Python>=3.7 and there are mentioned comments that need backwards compatibility (specifically Python 3.6). If that is the case, the Pydantic parts could be implemented as optional if desired.

Benefits

This would serve to provide the following benefits:

Provide a clear dataclass format for each of the data components. This would make it easy to deprecate and change if modifications were introduced to the core SigMF schema.
Abolish and automatically integrate any validation code associated with checking the .sigmf-meta JSON files. Pydantic would perform type, length and semantic checking on your behalf.
Significantly extend existing validation at minimal cost (not requiring lots of extra tests), introducing semantic and regex checks.
Minimal computational overhead (pydantic is written in Rust).
Would work seamlessly with the IQEngine package you are also developing, as Pydantic works great with JSON for HTTP requests, FastAPI and other common web libraries.

Examples

For example, in defining the schema for a SigMF Global object, one might:

from pydantic import BaseModel, FilePath

# mirrors the Global Object as defined in the schema
class SigMFGlobal(BaseModel):
    sample_rate: float
    datatype: str
    author: str | None
    dataset: FilePath # Path object which also automatically checks if the file at the directory exists... 
    ...

and for a single meta file one would define:

from pydantic import BaseModel, FilePath

class SigMFFile(BaseModel):
    global_: Annotated[SigMFGlobal, Field(alias="global")]
    captures: Annotated[list[SigMFCapture], Field(min_length=1)]
    annotations: list[SigMFAnnotation]

calling BaseModel.model_dump_json() automatically converts nested BaseModel objects into a JSON-compliant strings which is trivial to store to disk. Similarly, reading in a .sigmf-meta file could be parsed by a SigMFFile object (which extends BaseModel) to validate the JSON according to the schema without any custom checking code.

Pydantic can handle custom serialization formats by using aliasing, for example:

from typing import Annotated
from pydantic import Field, AliasChoices

sample_rate: Annotated[float, Field(
    # stores core:sample_rate into the JSON file when serialized
    serialization_alias="core:sample_rate",
    # enables core:sample_rate or sample_rate as valid JSON strings when reading in file
    validation_alias=AliasChoices("core:sample_rate", "sample_rate")
)]

helps to handle any discrepancies between the JSON naming format e.g "core:" or "my_custom_extension:<item"> and the Python variable name.

Extensions

Pydantic would be a great way to handle custom extensions for SigMF also, any custom user extension could simply extend a pre-defined extension class.

I think the 3.6 requirement was due to Ubuntu 18 coming with 3.6 and a lot of folks were still on 18 at the time, but a few years have passed and 18 EOLed last year anyway so we could consider bumping the min python to 3.7 or 3.8.

Pydantic is one of those things where it optimizes performance and if you are familiar with pydantic then the code is way simpler and more readable, but for those who are new to pydantic it will cause a higher level of effort to make tweaks and additions to the code, imo.

From an additional dependency perspective I personally don't see anything wrong with adding pydantic to the list (which now only includes numpy and jsonschema) although there will always be folks out there who like to keep it minimal for simplicity and security reasons.

Ultimately it's going to come down to how @Teque5 and @gmabey feel about Pydantic because they have been the main long-term maintainers of this code and even if you single-handedly refactored the entire code to use Pydantic and no regressions are introduced, if it makes it harder for them to maintain the project then it's a deal-breaker, especially since we don't need Pydantic to do what we want sigmf-python to do.

Thanks for getting back Marc - really love the PySDR book by the way - it really is one of the only great online resources out there for understanding SDRs, I/Q etc,

I think the 3.6 requirement was due to Ubuntu 18 coming with 3.6 and a lot of folks were still on 18 at the time, but a few years have passed and 18 EOLed last year anyway so we could consider bumping the min python to 3.7 or 3.8.

Glad to hear. Realistically it would be better if we bumped to v3.8 as I think with Python v3.7 it would use Pydantic V1 which had serious migration issues when moving to Pydantic V2.

Pydantic is one of those things where it optimizes performance and if you are familiar with pydantic then the code is way simpler and more readable, but for those who are new to pydantic it will cause a higher level of effort to make tweaks and additions to the code, imo.

Agreed, however I would argue it is good practice to follow model-view-controller design patterns, and Python dataclasses (and Pydantic) are the state-of-the-art to manage hierarchical metadata. In order to maintain the usefulness and power of the SigMF standard, it will require using the state-of-the-art tools that Python provides.

The burden of implementation would only fall on those maintaining the library, as for usage it would be as simple as:

handle = sigmf.sigmffile.fromfile(my_file)
# metadata is a pydantic.BaseModel following the SigMFMeta schema.
index = handle.metadata.captures[0].sample_start

Within an IDE this is great as a user you're able to find the properties you want with hints / suggestions, rather than needing to know a priori using the dictionary constants currently implemented : e.g handle.get_global_field(SAMPLE_RATE_KEY), or worse, the schema standard off by heart handle.get_global_field("core:sample_rate").

From an additional dependency perspective I personally don't see anything wrong with adding pydantic to the list (which now only includes numpy and jsonschema) although there will always be folks out there who like to keep it minimal for simplicity and security reasons.

We could implement it for now in an optional way which avoids minimal changes to the current API. For instance, adding a backend=pydantic parameter to the sigmf.sigmffile.fromfile() method would be trivial. This could then expose a new property, i.e metadata which holds the BaseModel object if pydantic is used, else returns None or the full metadata dictionary as dict.

In terms of security, there are a large number of companies which use Pydantic in their pipelines, so a security breach in Pydantic would affect considerably more than just SigMF users: https://docs.pydantic.dev/latest/why/#using-pydantic

Ultimately it's going to come down to how @Teque5 and @gmabey feel about Pydantic because they have been the main long-term maintainers of this code and even if you single-handedly refactored the entire code to use Pydantic and no regressions are introduced, if it makes it harder for them to maintain the project then it's a deal-breaker, especially since we don't need Pydantic to do what we want sigmf-python to do.

Agreed, I would anticipate that pydantic would significantly reduce the code footprint of the validation code, and Pydantic also works well with mypy out of the box, whereas static type checking a nested dictionary is essentially painful or useless.

But I'll wait until we can agree either to skip or come up with a basic API.

The choice to implement validation through jsonschema was due to lack of options at time of adoption. You've brought up some reasons why pydantic is better which may all be true. If it's possible to implement a pydantic implementation I have no issue but that will be a big PR. A perhaps insurmountable issue not mentioned here previously is that the specification in the main SigMF repository is now derived from the schema so we would need to (1) assure the website and PDF could still be dynamically generated from a different schema, and that (2) current use cases in this repository still work. It's possible I'm misunderstanding your proposal.

ps. We officially dropped support for 3.6 a while ago. Currently testing supports 3.7 through 3.12, mostly to support those on isolated and/or slow-to-update distros like RHEL, Debian, and Scientific Linux. I can see making a newer version the new minimum but it needs strong justification.

OK great.

Issues

Issue (1): Shouldn't be a problem as we won't delete the schema.json file for now. PDF / website in future could be produced using a GitHub runner (loading in Pydantic etc) on commit to the main branch to deploy web pages to your chosen URL. Pydantic of course generates JSON schema on command using BaseModel.model_json_schema() Issue (2): I'd have to check the specific use cases in question but Pydantic will handle automatically for you:

Type checks (str, int, float)
"Specialized type checks" - e.g non negative int - such as pesky negatives in "sample_count" or "sample_start"
Pattern matching e.g version or "datatype"
File existence prior to object creation (e.g core:dataset)
Semantic checks (e.g freq_upper_edge > freq_lower_edge, not currently supported)

Update

To address the primary concern of de-sync between the JSON schema and the Pydantic version, there is a hack - the datamodel-code-generator PyPI library https://docs.pydantic.dev/2.5/integrations/datamodel_code_generator/ provides a CLI to input a jsonschema file and output a Python file using Pydantic, I've just ran it against the schema-meta.json and it produces something looking reasonable.

Also good news, looks like Pydantic V2.5 works on Python 3.7+, so no need to bump the minimum version.

The base is there, just need to write some tests for it in a couple of days (when I get space between work).

@gmabey was wondering what your opinions are on pydantic

sigmf / sigmf-python