ioos / compliance-checker

Python tool to check your datasets against compliance standards
http://ioos.github.io/compliance-checker/
Apache License 2.0
108 stars 58 forks source link

More declarative approach to compliance checks #1081

Open benjwadams opened 4 months ago

benjwadams commented 4 months ago

Compliance Checker currently has a very imperative code style. For checks like those implemented by CF, the conformance docs form a set of steps to check, which would lend itself more towards a declarative code style, e.g. those contained in https://cfconventions.org/Data/cf-documents/requirements-recommendations/conformance-1.11.html.

We have comments in the code indicating when a part of the code is implements CF conformance in tests, such as in https://github.com/ioos/compliance-checker/blob/develop/compliance_checker/cf/cf_1_6.py#L1878-L1988 and in unit tests.

However, this isn't enforced anywhere and we don't directly check the conformance spec. Any suggestions for how we can better improve the composability of the codebase as well as testability against the points in the conformance docs? I've been experimenting recently with pytest-bdd and think something similar where each step is checked would be good. However, certain steps depend on others, which can be accomplished from BDD testing of features with multiple possible scenarios.

benjwadams commented 4 months ago

I think we should move towards a DAG approach, declaring each section of the conformance as a separate step.

Primary libraries under consideration are Dask and Airflow.

Airflow seems more geared towards enterprise ETL/data science workflows.

I've used Dask in the past for some QARTOD runs, but want to abstract away explicit declaration of the graph.

jcermauwedu commented 4 months ago

Arjan does a deep dive on python decorators that might be a more upfront way to do what I think you described for pytest fixtures. See: https://www.youtube.com/watch?v=QH5fw9kxDQA

He goes pretty quickly showing how to nest classes and then functions that run in the order you want. As the tests proceed, I don't know if there a bookkeeping type way to keep blocks from retesting the same parts of a dataset with the same rules. Now we are talking like being able to compute the test coverage of a dataset, specification vs the code coverage of a software package?