Using webargs to parse more complex query arguments (string or comparison operators,...)

lafrech commented 8 years ago

I would like my API to offer basic query features, like not only get item by ID but also using operators on item attributes:

GET /resource/?name=chuck&surname__contains=orris&age__lt=69

Proof of concept

Here is what I've done so far. I did not modify webargs code. I'm only subclassing in my own application.

I've been searching around and didn't find a universally accepted norm specifying such a query language. In my implementation, I'm using the double underscore syntax and a subset of operators from MongoEngine.

Basically, I have two families of operators, some operating on numbers, others on string.

NUMBER_OPERATORS = ('ne', 'gt', 'gte', 'lt', 'lte')

STRING_OPERATORS = (
    'contains', 'icontains', 'startswith', 'istartswith',
    'endswith', 'iendswith', 'iexact'
)

QUERY_OPERATORS = {
    'number': NUMBER_OPERATORS,
    'string': STRING_OPERATORS,
}

Those lists could probably be extended. Operators meanings should be easy to grap. Examples:

age__lt=69 means I expect the API to return records with attribute age lower than 69
surname__contains=orris means attribute age contains string "orris"

To let webargs parse such parameters, I need to modify the Schema so that for chosen fields, the Marshmallow field is duplicated (deepcopied) into needed variants. For instance, the field age is duplicated into age__lt, age__gt,...

This is an opt-in feature. For each field I want to expose this way, I need to specify in Meta which operators category it should use (currently only two categories: number and string). Auto-detection is complicated as I don't know how I would handle custom fields, and there may be other categories some day.

In the Schema, I add:

    class Meta:
        fields_filters = {
            'name': ('string',),
            'surname': ('string',),
            'age': ('number',),
        }

For each field, I'm passing a list of categories as one could imagine several categories applying to a field, but currently, I have no example of field that would use both number and string operators.

And the "magic" takes place here:

class SchemaOpts(ma.SchemaOpts):
    def __init__(self, meta):
        super(SchemaOpts, self).__init__(meta)
        # Add a new meta field to pass the list of filters
        self.fields_filters = getattr(meta, 'fields_filters', None)

class SchemaMeta(ma.schema.SchemaMeta):
    """Metaclass for `ModelSchema`."""

    @classmethod
    def get_declared_fields(mcs, klass, *args, **kwargs):

        # Create empty dict using provided dict_class
        declared_fields = kwargs.get('dict_class', dict)()

        # Add base fields
        base_fields = super(SchemaMeta, mcs).get_declared_fields(
            klass, *args, **kwargs
        )
        declared_fields.update(base_fields)

        # Get allowed filters from Meta and create filters
        opts = klass.opts
        fields_filters = getattr(opts, 'fields_filters', None)

        if fields_filters:
            filter_fields = {}
            for field_name, field_filters in fields_filters.items():
                field = base_fields.get(field_name, None)
                if field:
                    for filter_category in field_filters:
                        for operator in QUERY_OPERATORS.get(
                                filter_category, ()):
                            filter_fields[
                                '{}__{}'.format(field_name, operator)
                            ] = deepcopy(field)
            declared_fields.update(filter_fields)

        return declared_fields

class QueryArgsSchema(ma.compat.with_metaclass(SchemaMeta, ma.Schema)):
    OPTIONS_CLASS = SchemaOpts

And finally, I use the Schema to parse the query arguments:

    @use_args(ObjectSchema)
    def get(self, args):
        ...

Questions

This raises a few points.

Is there some sort of convention I missed when searching for a query language?
Is this out-of-scope for webargs or could it be a useful enhancement?
Is there no need for that? I've been investigating both flask-retful and marshmallow/webargs/... ecosystems, along with @frol's invaluable flask-restplus-server-example and saw nothing close to this, so I'm thinking maybe people just don't do that. Or maybe they only expose a few filters, and they do it in specific routes.
Should this be in @touilleMan's marshmallow-mongoengine? I do use this library, but although the query language is inspired^Wshamelessly copied from MongoEngine (which allows me to pass the query arguments straight into the QuerySet filters...), the whole thing has no dependency on MongoEngine and this should be a generic feature.
My QueryArgsSchema also has sort and pagination fields, but I didn't expose them here as this needs nothing fancy on Marshmallow's side. Maybe a real "query parameters" feature would integrate these as well.

Feedback greatly appreciated. It seems to work right now, but on the long run, I might discover it was poorly designed from the start.

Thanks.

frol commented 8 years ago

@lafrech Thank you for this interesting idea!

Here are my thoughts on this matter:

While suffixes are fine from URL point of view, they would look bloated in Swagger config since you will have every parameter expanded into 6 or 8 parameters (depends on the parameter type: number or string). This is why I probably won't implement querying this way.
What are the options? I haven't thought on this long enough, but since I wouldn't expand parameters, there are at least two ways to go:
1. transform suffixes in parameters into masks and operators in values, e.g.:
```
?name=chuck&surname=*orris*&age=<<69
```
  =<< is ambiguous less-not-equal operator, yet it is a solution.
2. transform suffixes into functions in values, e.g.:
```
?name=chuck&surname=contains(orris)&age=less_than(69)
```
Personally, I haven't yet faced a need to support arbitrary filtering and currently I just use dedicated parameters to achieve similar behavior:
```
?name=chuck&surname=orris&min_age=69
```
My backend implementation just treats name and surname as exact matching first and as you would with __contains next, and min_age + max_age can cover all your suffixed cases except __ne, but __ne is not relevant everywhere either.

Does this make sense?

lafrech commented 8 years ago

Thanks @frol for the feedback. I'll be investigating this next week.

Just a quick note to add that on second thought, I should probably use a custom parser.

sloria commented 7 years ago

Good discussion here. I'm going to close this for now, though, since API filtering is outside the scope of webargs.

lafrech commented 7 years ago

Alright. For the record, for our application, we ended up using custom "search" URL with a query field in which the client is supposed to enter a query (in python form, PQL translates python to MongoDB). This is not the most satisfying solution. It is totally linked to the database behind the API, or at least the programming language. But it does the job for now and we lack time to aim for ideal.

antgel commented 5 years ago

@sloria I'm coming across a similar design conundrum in my app. If API filtering is outside the scope of webargs (and I don't understand why given that the filter arguments are... well, web arguments), what scope is it in? App is flask(-classful) / webargs / marshmallow / sqlalchemy.

lafrech commented 5 years ago

Perhaps in webargs through a custom parser but not webargs core.

sloria commented 5 years ago

Yes, to clarify my above statement: webargs may be used to parse API filter arguments, but it is agnostic re: filtering syntax (age__lt, age[lt], etc). As Jérôme pointed out, a custom parser should be used to handle your app's specific syntax.

antgel commented 5 years ago

Thanks for chipping in - it's useful. I'll continue here unless you think another place is best for the discussion. I've been reading the marshmallow docs on custom fields and extending schemas, and I still have questions.

Let's assume I have a pizza delivery endpoint /deliveries/ with query string parameters status / start_time, and I want to add "special" parameters such as status__in and start_time__lt so I can do things like GET /deliveries/?status__in=new,in_oven.

I'm passing a ModelSchema to use_kwargs. In the past (before these complex query parameters became necessary), I had view code like:

deliveries_schema = DeliverySchema(many=True)

    # I use .fields because ModelSchema doesn't deserialize to a dict 
    @use_kwargs(deliveries_schema.fields)
    def index(self, **kwargs):
        deliveries = Delivery.query.filter_by(**kwargs)
        return deliveries_schema.jsonify(deliveries)

This worked, but it won't work any more. Not surprising, as it's a brittle design that naively assumes that I want to pass all query string parameters to filter_by as is. I need a way to pass the "normal" arguments to filter_by, and then parse the "special" arguments and send them to sqlalchemy.

I don't think a custom parser is the whole solution, it seems I also need to make my schema and view aware of these "special" parameters? Query string parameters e.g. status that were known in the ModelSchema are handled, but special parameters e.g. status__in are ignored. For example, GET /deliveries/?status__in=new,in_oven doesn't work.

The custom parser docs say "To add your own parser, extend Parser and implement the parse_* method(s) you need to override", but I don't see how that helps if parse_querystring ignores status__in completely.

Also, I don't wish to explicitly add each special parameter, as it would quickly get unmaintainable. I'm looking for magic that recognizes the __whatever suffixes and handles them accordingly.

I used https://github.com/marshmallow-code/marshmallow/issues/29 as inspiration, but option 1 (additional fields) isn't an option because I don't know what special query parameters I'll receive at compile. I tried using option 2 (custom data handler) which seems to have been replaced by post_load.

But in the above case, post_load doesn't run, yet post_dump does. Even weirder, if I replace the above index function with one that uses use_args such as:

    @use_args(DeliverySchema())
    def index(self, args):
        deliveries = Delivery.query.filter_by(args) # This won't work anyway, but worry about that later
        return deliveries_schema.jsonify(deliveries)

both post_load and post_dump run.

I don't understand why this happens, but in the top example, if I replace @use_kwargs(deliveries_schema.fields) with @use_kwargs(DeliverySchema()), then post_load does run (unfortunately followed by a crash and burn because ModelSchema doesn't deserialize into a dict so I can't pass it to use_kwargs).

I guess general questions are in order at this point. What should the architecture look like to make *__in-style query parameters "work" and pass other parameters to sqlalchemy for filtering? Does any of the above make sense?

I'm in a bit of a rabbit hole and would appreciate any support to get out. If I've gone completely round the houses, that's fine, I probably learnt something on the way, but I'm pretty stuck. :)

lafrech commented 5 years ago

Hi.

unless you think another place is best for the discussion

No pb. Here is fine.

Unless I'm missing something, what you're seeing is normal. post_load is called after the Schema loads the inputs data in webargs, and post_dump after the Shema dumps the return data in the view (I suppose jsonify is a method you added to do dump + jsonify ?).

When doing @use_kwargs(deliveries_schema.fields), you don't use the Schema, only its fields, so it's no surprise post_load is not called.

This worked, but it won't work any more. Not surprising, as it's a brittle design that naively assumes that I want to pass all query string parameters to filter_by as is.

Yes.

When doing REST, in the general case, I use the model Schema (DeliverySchema) to load POST/PUT request payload and dump GET/POST/PUT response payload (location = json/body). And I use a specific handmade schema (DeliveryQueryArgsSchema) to parse GET arguments (location = querystring).

In this handmade Schema, I add manually the fields for the parameters I want to use. It smells like duplication, as if I want to allow filtering by any field, I need to add all the fields manually, but in practice I came to realize that I never wanted that. For instance, sorting by string fields such as name, address or so was not really realistic. IDs are meant to be used as filters. Names could be used by complex pattern searching features for partial string matching, but as plain filters, they kinda suck as the user can't be expected to enter the real name with no typo, the proper case and accents, etc.), otherwise there'd be no need for an ID (this is totally peremptory and arguable, MongoDB typically uses plain strings for ID in their tutorials, but you get the point). Floats are a better example. No point filtering by float since float equality is almost impossible technically and makes no sense in real life. You'd rather use a gt/lt interval

In a project, we also crafted a sort field that deserializes inputs such as name,-age into a query sorting by name and reversed age.

There could be ways to avoid doing everything manually. You could create a function that would generate those DeliveryQueryArgsSchema the clever way, given DeliverySchema .

It could use some fields as filters, maybe only some types of fields, or maybe only those to which you would have passed use_as_filter=True in the Schema declaration.
It could add _lt and _gt and such to all number and datetime fields. Note there may be a lot of them, with lte, gte, etc.
And so on.

Also, note @frol's advice above:

While suffixes are fine from URL point of view, they would look bloated in Swagger config since you will have every parameter expanded into 6 or 8 parameters (depends on the parameter type: number or string). This is why I probably won't implement querying this way.

If you need complex querying (I mean non-trivial, I totally agree that what you're talking about it not complex API and sounds rather like everyone's needs), maybe you're better-off using a String field in webargs, document it as "my custom query string" in the docs (assuming you're documenting your API with a tool like marshmallow's apispec) and do the query parsing in a custom function.

Beware of the quest for the silver bullet. There may be no point having a frameworks that creates an API allowing the user to do everything on the data. Sometimes, there are things you don't want him to do, so you need to be able to add exceptions, like removing part of the automatically generated fields. And some other things are harmless but useless. Maybe you only need 100% of it.

YMMV. In our case, we realized that we didn't need that many options in the query, so we removed the PQL code (see above) from our application and just added the fields manually for the filters we wanted to provide the user. Overall, there were not that many fields to add.

Hope this helps...

antgel commented 5 years ago

Thanks for that. I agree that I don't need to be able to perform every field / query permutation. If the "suffix magic" I'm looking for is non-trivial to implement, then it is the right thing to "satisfice". If the *Schema and *ArgsSchema split you speak of is the de-facto solution for this in webargs, it's good enough for me. Key insights:

It is not a must to use the same schema for every view function.
Using post_load was not a useful approach because it uses (updates) the same schema.
I hoped for too much magic.
This approach doesn't use a custom parser as suggested previously.

Anyway, I added a separate Schema for GETs and a custom query class on the sqlalchemy side that parses the key for a __, and if it finds one, splits the key into the real key and the query operator. Seems to work so far for my basic __in use case, with a bit of hacking to build the filters for the query.

I like it because I don't think it's overengineered but it feels like it will be extendable. So thanks a lot to all in this thread for the inspiration!

I felt the idea of the custom query field would involve extra amounts of parsing work, maybe I'm right, maybe I'm wrong, but at the moment things work so I'm happy. :)

Aside: Is either of use_args and use_kwargs generally preferred? I understand what each does, but not always in which use cases I should prefer one over the other. I can see that with use_kwargs I don't have to bother iterating through arguments to know what is present, and I can write things like if kwargs["status__in"]:. So is there ever a reason to prefer use_args?

Aside 2: deliveries_schema.jsonify uses the flask_marshmallow jsonify method.

lafrech commented 5 years ago

Our typical PUT (I stripped off unrelated stuff and exception handling).

The arguments decorator uses use_args internally. This allows us to separate item_id, injected by flask from the route, and new_item, injected by use_args.

@blp.route('/<objectid:item_id>')
class BuildingsById(MethodView):

    [...]

    @blp.arguments(BuildingSchema)
    @blp.response(BuildingSchema)
    def put(self, new_item, item_id):
        """Update an existing building"""
        item = Building.get_by_id(item_id)
        item.update(new_item, BuildingSchema)
        item.save()
        return item

antgel commented 5 years ago

@lafrech Sure, but why use_args over use_kwargs? :)

lafrech commented 5 years ago

Well, I like to have new_item in a single variable. If I used use_kwargs, the view func would look like this, I suppose:

@blp.route('/<objectid:item_id>')
class BuildingsById(MethodView):

    [...]

    @blp.arguments(BuildingSchema)
    @blp.response(BuildingSchema)
    def put(self, item_id, **new_item):
        """Update an existing building"""
        item = Building.get_by_id(item_id)
        item.update(**new_item, BuildingSchema)
        item.save()
        return item

And what if the object has an attribute named item_id? Alright, this shouldn't happen.

The object attributes would be mixed with the ID in the function signature, which doesn't convey the intent.

Well that's my understanding. I have no strong opinion about that. And maybe I'm just wrong about use_kwargs.

marshmallow-code / webargs

Using webargs to parse more complex query arguments (string or comparison operators,...) #112

Proof of concept

Questions