How to report errors for container fields?

lafrech commented 6 years ago

Generally, a field name is associated a list of errors as strings. The exception to this is container fields. Those may return errors for the container and the contained field(s). In this case, rather than a list, a dict is used where the keys are the field names or the list index (as string or as integer):

errors = {
    '_schema': [...],
    'field': [...],
    'nested': {
        # What about errors in the Nested field itself?
        '_schema': [...],
        'field': [...],
    },
    'nested_many': {
        # What about errors in the Nested field itself?
        '0': {
            '_schema': [...],
            'field': [...],
        },
    },
    'list': {
        # What about errors in the List field itself?
        '0': [...],
    }
    'list_nested': {
        # What about errors in the List field itself?
        '0': {
            # What about errors in the Nested field itself?
            '_schema': [...],
            'field': [...],
        },
    },
    'dict': {
        # What about errors in the Dict field itself?
        '0': {
            'key': [...],
            'value': [...],
        }
    }
    'dict_nested': {
        # What about errors in the Dict field itself?
        '0': {
            'key': [...],
            'value': {
                # What about errors in the Nested field itself?
                '_schema': [...],
                'field': [...],
            },
        }
    }
}

This has been working relatively fine for a while. It leaves a hole as you can't return errors relative to the Nested field itself, for instance, while returning errors for the fields in the nested schema.

It is generally not blocking, as the most frequent Nested field level error is the missing error, and when the field is missing, there is generally no nested error, so we return

    nested: ['Missing data for required field.']

It is a shame that the structure (dict or list) depends on the error, but we could live with this.

Except we sometimes need to return errors for both the field and the nested schema, so we came up with this (https://github.com/marshmallow-code/marshmallow/pull/299):

    'nested': {
        '_field': [...],
        '_schema': [...],
        'field': [...],
    },

where _field stores errors about the field itself.

This is inconsistent, as I said in https://github.com/marshmallow-code/marshmallow/pull/299#issuecomment-401607834 and showed in https://github.com/marshmallow-code/marshmallow/pull/863 because the errors on the nested field itself may be reported in different places depending on whether there are errors in the nested schema. In other words, a dict with a _field attribute is used only if there are also errors in the nested schema, otherwise, the errors are returned as a strings, like for any field.

The same considerations apply to all container fields (List, Dict).

I can't think of an ideal solution to allow reporting errors from both the container field and its nested schema/fields.

We'd like to keep the use of dict for containers because it makes browsing the error structure convenient:

    assert 'Some error.' in errors['nested']['field']

This rules out putting nested errors under a dedicated _nested key.

On the other hand, putting errors on fields themselves in a dedicated _field key for all fields seems a bit overkill as well.

A trade-off would be to use the _field key for fields errors for all container fields and only for them, but do it consistently.

I suppose that was the intent of https://github.com/marshmallow-code/marshmallow/pull/299.

errors = {
    '_schema': [...],
    'field': ['Missing data for required field.',...],
    'nested': {
        '_field': ['Missing data for required field.', 'Other error from schema level validator.'],
    },
}

This introduces a distinction between containers and other fields.

It looks bad in the usual case where fields are missing, as shown above, because containers have their "missing data" message embedded in a dict.

Also, the implementation might be challenging: how do you know a field is a container when registering the error?

I'd like to keep it simple, but I'm out of ideas, so I'm appealing to collective wisdom.

I have the feeling that a good decision on this would avoid many corner case issues, so this should be set before piling-up patches.

lafrech commented 6 years ago

[Let's strip this from OP, as it is long enough already.]

BTW, note that there can be errors for a missing field if a schema-level validator (with skip_on_field_errors=False) raises errors and blames them on the nested field, for instance:

     raise ValidationError('Some error', 'nested')

so this is possible:

    {'nested': ['Missing data for required field.', 'Other error from schema level validator.']}

Worse, if nested fields could be reached using dot notation (I don't think it is possible right now), you could even have errors for fields deeply nested in a missing nested field:

    raise ValidationError('Some error', 'nested.nested_again.field')
    errors = {
        'nested': {
            '_field': ['Missing data for required field.'],
            'nested_again': {
                'field': ['Some error']
            }
        }

This sounds wrong so it's probably better not to support this.

lafrech commented 6 years ago

@sloria, @deckar01, any ideas?

lafrech commented 5 years ago

I've been thinking about this and I might have come up with something.

schema and field are more or less marshmallow-specific. They don't necessarily make sense from client perspective.

Also, I don't think there needs to be a difference between errors in a nested field (e.g. 'Missing data for required field' and errors in a nested field's schema ('Invalid data type'). So nested _schema errors can be merged with errors about the Nested field.

My proposal is to keep an error dict structure to follow the hierarchy of the Schema, and for each field with errors, store errors list under a special _errors key.

This is what I excluded at first when I wrote

putting errors on fields themselves in a dedicated _field key for all fields seems a bit overkill as well.

but I see no alternative, and thinking again, it makes things more consistent: every value is a subdict, unless the key is _errors, in which case it is a list of errors, so the dict is easier to parse as the type of each value is well defined.

Here's what it would look like:

errors = {
    '_errors': [...],  # Schema errors
    'field': {'_errors': [...]},  # Errors about 'field' field
    'nested': {
        '_errors': [...],  # Errors about 'nested' field or nested schema errors
        'field': {'_errors': [...]},  # Errors about 'nested.field' field
    },
    'nested_many': {
        '_errors': [...],
        '0': {
            '_errors': [...],
            'field': {'errors': [...]},
        },
    },
    'list': {
        '_errors': [...],
        '0': {'_errors': [...]},
    }
    'list_nested': {
        '_errors': [...],
        '0': {
            '_errors': [...],
            'field': {'_errors': [...]},
        },
    },
    'dict': {
        '_errors': [...],
        '0': {
            'key': {'_errors': [...]},
            'value': {'_errors': [...]},
        }
    }
    'dict_nested': {
        '_errors': [...],
        '0': {
            'key': {'_errors': [...]},
            'value': {
                '_errors': [...],
                'field': {'_errors': [...]},
            },
        }
    }
}

I didn't assess the implementation difficulty.

Thoughts anyone?

lafrech commented 5 years ago

Separating _schema and _field is an independent issue. I think we should merge them but it may not matter so much. Let's assume we do.

Always use `_errors` for fields returning error `dict`s

To be consistent, errors in a nested field should always be a dict, whether or not there are errors in fields in the nested schema. E.g.:

    'nested': {
        '_errors': [...],  # Errors about 'nested' field or nested schema errors
        'field': ...,  # Errors about 'nested.field' field
    },

It is the mixed approach in #299, and it has two issues:

Inconsistency: Some fields have a _errors key, most don't. We could live with that.

    'field_1': ['Error', 'Another error'],
    'field_2': ['Error', 'Another error'],
    'nested': {
        '_errors': ['Error', 'Another error'],
        'field': ...,  # Errors about 'nested.field' field
    },

Implementation: When store_error is called, the type of the field is unknown, the only available informations are the type of the error messages to add (dict or list) and the type of already existing error messages if any. There is no way to ensure nested fields always have their errors under an _errors key.

(Note that this apply to other fields such as List, Dict and custom fields returning dict errors and that this category of fields is only defined by its return value. Any custom field could return a dict, so relying on the field type and matching it with a hardcoded list would not be sensible, anyway. Unless explicitly forbidden, there may even be custom fields returning different types of values depending on the errors.)

I don't see how to have this working correctly. We could hack store_error to always check already existing error messages for the field and ensure no data is lost (ensure a dict won't erase a list or conversely) and catch type errors (trying to append a list to a dict), but I don't think we can fix the "nested fields generate sometimes error lists, sometimes error dicts" issue.

I think the "_errors everywhere" approach is the only approach that is safe against this.

Use `_errors` for all fields

This is why I suggested to use _errors in every field:

    'field_1': {'_errors': ['Error', 'Another error']},
    'field_2': {'_errors': ['Error', 'Another error']},
    'nested': {
        '_errors': ['Error', 'Another error'],
        'field': ...,  # Errors about 'nested.field' field
    },

The fact that is it more verbose may not be an issue. Those messages are meant to be read by software. Human readable errors would be even better, but I think it is secondary. This is debatable.

While implementing this, I realized it is a bit invasive when it comes to "structural" fields returning errors dicts, such as List or Dict, because the fields themselves need to be modified to use that _errors key.

If we go this way despite the ugliness and verbosity of it, we'll have to either modify the way those structural fields return their errors or accept a little inconsistency in the errors inside those fields.

I pushed a branch to show examples (https://github.com/marshmallow-code/marshmallow/tree/dev_879_report_error)

First commit without changes in List/Dict. Little inconsistency as _errors does not appear in their structure.

Second commit with changes in List/Dict to ensure consistent use of _errors.

I added a test to produce an output more or less equivalent to the examples above, in which I copy-pasted the pretty-print of the generated errors dict.

Modifying those fields is not necessarily an issue. I wish all that logic could stay in the Marshaller/Unmarshaller, but maybe that's just the way it is.

lafrech commented 5 years ago

Some solve this by flattening the error structure, but I don't like it so much.

https://medium.com/engineering-brainly/structuring-validation-errors-in-rest-apis-40c15fbb7bc3

lafrech commented 5 years ago

Apparently, flask-restplus and Django use the same kind of dict structure as marshmallow.

Anyone has any experience with those and know how they deal with this?

lafrech commented 5 years ago

More thoughts and a different approach.

I should back this up with experimentation and a deeper analysis, but I have the feeling that the whole issue can be solved by

merging _schema/_field as proposed above
dropping skip_on_field_error=False possibility, in other words always skip schema validation if there are errors in the fields

I think the latter would mean that a container field could only have errors in the nested fields (thus a dict error structure without _schema or _field) or in itself (thus a list error structure).

When parsing the error structure, the client would expect either a dict (subfield errors) or a list (field errors), which I find reasonable.

Obviously, this would be a breaking change as it would not be possible to run schema validators in case of field errors, but is that really an issue? Would it be that bad? From the discussions in https://github.com/marshmallow-code/marshmallow/issues/323, it looks like this behaviour is surprising and not desired and the skip_on_field_error flag was introduced to remove it in a non-breaking way. Perhaps removing it completely wouldn't be problematic. At least, in the discussion, nobody stood up to defend it.

I hope I didn't miss another corner case. If all of this is correct, then we have a solution that is much simpler and much less breaking than what I exposed in the comments above. And an error structure that is close to django, flask-restplus and what marshmallow users are used to.

sloria commented 5 years ago

Thanks @lafrech for your analysis. I'll have a closer look at this thread soon. In the meantime, it might be worth evaluating how other libraries handle this. See the ones that I looked at for the unknown feature: https://gist.github.com/sloria/2fac357710f13e855a5b9cf8f05a4da0

lafrech commented 5 years ago

I tried flask-restplus quickly but I didn't succeed in reproducing the example in the docs.

I didn't try any of the libs in your list. Here's what I gathered from a quick look in the docs.

Cerberus

http://docs.python-cerberus.org/en/stable/errors.html

The error structure is a dict-like structure corresponding to the nested structure of the schema. Errors are stored in the errors attribute of each node. This solves the "dict of list ?" issue: each node can hold a subdict and a list. But IIUC, this means the error structure is not a simple type. I don't see how that would serialize in json to be returned by an API, for instance.

Voluptuous

https://github.com/alecthomas/voluptuous#multi-field-validation

Apparently, the recommended method won't apply schema validators in case of field errors, like I suggested above.

lafrech commented 5 years ago

/me wrote in https://github.com/marshmallow-code/marshmallow/issues/879#issuecomment-430229952:

I think the latter would mean that a container field could only have errors in the nested fields (thus a dict error structure without _schema or _field) or in itself (thus a list error structure).

Almost but not quite.

Indeed field validators are skipped if fields fail deserialization and schema validators would be skipped as well in case of error if we dropped skip_on_field_error=False. But there could still be several field validators or several schema validators reporting multiple errors on the same field.

What makes things complicated is the fact that validation errors may report more or less any structure: dict or list. So when several errors are reported on a field, you get to merge different error structures and here the troubles (and the inconsistencies: https://github.com/marshmallow-code/marshmallow/pull/299#issuecomment-401607834) begin.

If only one error can be reported, we don't care about the structure, there is no need to merge list with dicts, and there is no need for a special _field key. That's why I hoped we could simplify things to be sure all field errors for a given field are reported at once. It doesn't seem to be the case.

We can live with the inconsistency, but things are worse: if an error is reported as dict, with current code, it overwrites the former error messages. I didn't take the time to check that yet but perhaps the merge function proposed in https://github.com/marshmallow-code/marshmallow/pull/442 would avoid this.

I can't comment about the fix proposed in #442 but I agree with the OP in #441. We could be more strict about what an error message should be, and impose that if a dict, it is a hierarchical structure (either subfields or equivalently mergeable subelements like in List or Dict).

lafrech commented 5 years ago

But there could still be several field validators or several schema validators reporting multiple errors on the same field.

Well... almost, but not exactly.

Currently, @validates can only be called once per field (see https://github.com/marshmallow-code/marshmallow/issues/653). But this is a limitation in the decoration feature that we might want to overcome one day. And "several schema validators reporting multiple errors on the same field" is possible.

I've been spending hours on this and I don't see any way to fix the "inconsistency" (field returning errors either as list or as dict, same error being sometimes in the list, sometimes in the dict (under the _field or _schema key), depending on other errors).

However, I think #1026 is a great step towards a clearer interface and it ensures we're not losing error messages in the process.

I believe this issue can be postponed to after 3.0. Most probably 4.0 as changing this would be a breaking change.

marshmallow-code / marshmallow