Add validation guidelines

lukpueh commented 4 years ago

There seems to be agreement to discontinue the securesystemslib schema facility (see https://github.com/secure-systems-lab/securesystemslib/issues/183). We still need to be able to validate all inputs at the user boundary (type annotations should make this a lot easier), and provide tools to check if metadata is spec compliant (maybe we can use something like JSON schema?). At any rate, it would be helpful for contributors to provide guidelines for validation.

lukpueh commented 4 years ago

Using in-toto style ValidationMixin might be a commendable way to validate metadata in memory (see the mixin and it's usage for details).

lukpueh commented 3 years ago

https://github.com/secure-systems-lab/code-style-guidelines/issues/18 has an interesting discussion about input validation, control flow and program consistency.

joshuagl commented 3 years ago

The approach I would take to this research project, is:

Understand the current validation mechanism in use:

Review securesystemslib.schema, its purpose and its flaws.
Review in-toto's ValidationMixin, which validates metadata in memory utilising securesystemslib.schema (see example usage).

Review existing external/3rd-party solutions:

pydantic -- uses type annotations to provide data validation (runtime type hint validation) and settings management.
marshmallow -- uses schemas to provide simplified object serialisation, deserialisation, and input data validation.

Understand options for custom validation logic

Descriptors seem well suited for attribute validation. However, they may not allow for the currently supported pattern of initialising empty objects, assigning values, and later validating them (from https://github.com/theupdateframework/tuf/issues/1140#issuecomment-738108971).

For each of the three possible new approaches suggested above, I would expect some prototype code to be written to get a feel for how the approach fits with our new code. I'd be inclined to base on #1279, if it has not already been merged by the time we get to experimenting with new approaches.

Goals: We want to be able to:

validate all inputs at the user boundary.
provide tools to check if metadata is specification compliant.

Outcomes:

Submit an ADR on input validation, summarising different approaches and making a recommendation for the project.

Considerations:

Are the stated goals appropriate/sufficient?
Which Python versions are supported by the various mechanisms explored?
pip is about to become a major user and has to vendor any dependencies we add. This may be a good argument for a custom implementation, or perhaps not if the transitive closure of dependencies to vendor is small.
Be aware of performance impact (see https://github.com/secure-systems-lab/securesystemslib/issues/183#issuecomment-596990893).
Other third-party solutions exist i.e. desert and typical.

Next steps:

Input validation for simple metadata API (https://github.com/theupdateframework/tuf/issues/1140).

See also, the related issue on input validation for metadata API: https://github.com/theupdateframework/tuf/issues/1140

Other possibly useful references:

secure-systems-lab/code-style-guidelines#18 has a discussion about input validation, control flow and program consistency.
blog post about instance attribute validation techniques including the in-toto approach and Descriptors
The section of the Hypermodern Python guide on typing discusses data validation with Desert and Marshmallow

MVrachev commented 3 years ago

The initial version of the ADR addressing this issue is out: #1301. It contains only two options for now ValidationMixin and pydantic.

MVrachev commented 3 years ago

Update on what has happened so far:

I documented multiple validations options in #1301.
It was decided that I would create validation functions for each of the attributes before deciding which validation option is good for us from ADR 7.
Created #1366 to showcase how validation functions are supposed to work with python descriptors. After Jussi's comment here, we decided it's best to do research on how we use the metadata attributes and think about what might go wrong with them.
I started creating issues for each of the metadata attributes in the format Metadata Attribute research - * where * is the name of the attribute. For example #1419, #1420.

I will unassign myself from this issue for now, because I am not actively working on validation guidelines ADR. Before that, it's important to understand how do we want to operate and store all of the metadata attributes, provide validation functions for them and decide which validation option do we want to use from ADR7 or something totally new.

MVrachev commented 2 years ago

Together with @lukpueh we have discussed that a formal ADR about validation guidelines seems too much of work and we are not sure we needed it as we have already implemented validation for all Metadata classes (see https://github.com/theupdateframework/python-tuf/issues/1140#issuecomment-971588922).

Even if there is no ADR there is a sense in providing some guidance about how the maintainers feel about validation, what validations options were discussed and what requirements should be taken into account when adding validation to python-tuf. It seems that the best place to answer those questions is in a blogpost published on https://theupdateframework.github.io/python-tuf/ and together with @lukpueh agree that this is the logical step that will close this issue.

theupdateframework / python-tuf

Add validation guidelines #1130