pr-omethe-us / PyKED

Python interface to the ChemKED database format
https://pr-omethe-us.github.io/PyKED/
BSD 3-Clause "New" or "Revised" License
14 stars 15 forks source link

Where should validation be done, in the Validator or in the construction of ChemKED/DataPoint instances? #103

Open bryanwweber opened 6 years ago

bryanwweber commented 6 years ago

I'm working on adding the time-histories feature now, and I've realized that we do validation all over the place. For instance, we check that the units are appropriate in the Validator class

    def _validate_isvalid_unit(self, isvalid_unit, field, value):
        """Checks for appropriate units using Pint unit registry.
        Args:
            isvalid_unit (`bool`): flag from schema indicating units to be checked.
            field (`str`): property associated with units in question.
            value (`dict`): dictionary of values from file associated with this property.
        The rule's arguments are validated against this schema:
            {'isvalid_unit': {'type': 'bool'}, 'field': {'type': 'str'},
             'value': {'type': 'dict'}}
        """
        quantity = 1.0 * units(value['units'])
        try:
            quantity.to(property_units[field])
        except pint.DimensionalityError:
            self._error(field, 'incompatible units; should be consistent '
                        'with ' + property_units[field]
                        )

but we check that all the uncertainty fields are present in the DataPoint class:

raise ValueError('Either "uncertainty" or "upper-uncertainty" and '
                 '"lower-uncertainty" need to be specified.')

Should we try to do all of the validation in the Validator class and assume that input is well-formed coming into the e.g., DataPoints class? Should we have this (sort of duplicate) validation?

Relatedly, we also do a bunch of checks for e.g., allowed values and so forth, which are really tests of Cerberus. Should we keep these, or rely on Cerberus to be correct?

I'm wondering about this from a few perspectives. One is that as the test suite gets bigger, it's taking longer to run (I'm up over a minute and a half right now). Another is that all these checks all over the place make it harder to reason about what the code is doing. Finally, these kind of checks "require" small tests to hit the other half of the branch (to keep 100% coverage) that don't really add much information... all they say is we can write a test to exercise the other half of the branch, not whether actual user behavior is going to trigger the other half properly.

I tagged this discussion because I think we should let this simmer for a while and think/talk about what to do, if anything 😄

bryanwweber commented 6 years ago

One thought about the tests that sort of duplicate Cerberus functionality, something like testing that a missing required field fails validation - this is a bit of a check on us, that the fields we expect to be required are in fact required (i.e., that we didn't screw up the schema), so there may be some value in that.

bryanwweber commented 6 years ago

After discussion, we decided to split most of the validation out into separate functions, especially for domain-knowledge specific validations (correct units, sum of mole fractions == 1.0, etc.), leaving the basic file-format validation only in Cerberus. This will make it easier to have sensible error messages as well.