feat: Validate dependencies.yaml using jsonschema

vyasr commented 1 year ago

This PR enables validating the contents of a dependencies.yaml file directly without doing any processing. The schema is encoded using JSON Schema and validated using the Python implementation. The new Python code is fairly minimal, and it would be even shorter except that I leveraged the object-oriented API to show all errors in a file instead of simply showing the first error using jsonschema.validate. The majority of the new lines are from the schema definition. The validation is injected into the normal CLI usage so that schemas are always validated before dependency files are generated, ensuring that developers see useful errors about why their dependencies.yaml file is invalid rather than opaque runtime errors when dfg fails to use the file.

vyasr commented 1 year ago

As an example, applying this patch:

--- a/tests/examples/no-specific-match/dependencies.yaml
+++ b/tests/examples/no-specific-match/dependencies.yaml
@@ -1,14 +1,11 @@
 files:
   all:
-    output: conda
     requirements_dir: output/actual
     matrix:
       cuda: ["11.8"]
     includes:
       - cudatoolkit
-channels:
-  - rapidsai
-  - conda-forge
+channels: 1234
 dependencies:
   cudatoolkit:
     specific:

and rerunning tests results in

------------------------------------------------------------------------------------------ Captured stderr call -------------------------------------------------------------------------------------------
Error #1:
        'output' is a required property

        Failed validating 'required' in schema['properties']['files']['patternProperties']['.*']:
            {'properties': {'conda_dir': {'type': 'string'},
                            'includes': {'items': {'type': 'string'},
                                         'type': 'array'},
                            'matrix': {'type': 'object'},
                            'output': {'if': {'type': 'array'},
                                       'then': {'items': {'type': 'string'}},
                                       'type': ['string', 'array']},
                            'requirements_dir': {'type': 'string'}},
             'required': ['output', 'includes'],
             'type': 'object'}

        On instance['files']['all']:
            {'includes': ['cudatoolkit'],
             'matrix': {'cuda': ['11.8']},
             'requirements_dir': 'output/actual'}
Error #2:
        1234 is not of type 'array', 'string'

        Failed validating 'type' in schema['properties']['channels']:
            {'if': {'type': 'array'},
             'then': {'items': {'type': 'string'}},
             'type': ['array', 'string']}

        On instance['channels']:
            1234
========================================================================================= short test summary info =========================================================================================
FAILED tests/test_examples.py::test_error_examples[no-specific-match] - RuntimeError: The provided dependencies data is invalid.
====================================================================================== 1 failed, 13 passed in 0.53s =======================================================================================

vyasr commented 1 year ago

Just making some notes of my planned next steps here to help reviewers understand where I was going with this and identify any blind spots I may have:

I will move the schema out of the Python file into a JSON file so that it can be maintained separately.
I would like to implement some form of versioning for the schema. The easiest would be to simply tie its version to the package version.
I would like to consider removing support for values that are either strings or arrays of strings. I'm not convinced of the value-add given that it leads to undesirable complexity in parsing (all the _ensure_list nonsense that we have to make sure to do, which is especially problematic because without that you could silently allow unexpected behavior because the in operator in Python will do a substring search when the right operand is a string rather than a collection).

vyasr commented 1 year ago

Thanks to a makeover from @csadorf I think this PR is ready for review. @ajschmidt8 let me know what you think of it.

One minor note, I snuck in an isort bugfix.

github-actions[bot] commented 1 year ago

:tada: This PR is included in version 1.1.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

rapidsai / dependency-file-generator

feat: Validate dependencies.yaml using jsonschema #29