ponyorm / pony

Pony Object Relational Mapper
Apache License 2.0
3.64k stars 245 forks source link

How to filter query inside group statement #20

Closed JakeAustwick closed 10 years ago

JakeAustwick commented 11 years ago

Title probably doesn't describe this, wasn't sure how to explain it.

Basically, I want to filter my DB columns via url parameters. With a normal select statement, I have this working fine. However I now need to run the filters on a select statement within another select statement and then call .count() on it, the following example will probably explain more.

def model_with_filters(model, aid, filters):
    query = select(x for x in model)
    for column_name, value in { k: v for k, v in filters.iteritems() if k in model._columns_ and v != ""}.iteritems():
        query = query.filter("lambda x: x.%s == '%s'" % (column_name, value))
    return query.filter("lambda x: x.aff_id== aid")

# This is what I want to do
select((a, model_with_filters(SendPin, a.aff_id, request.args).count) for a in Affiliate)
TypeError: Function 'model_with_filters' cannot be used inside query

However I am not allowed to use a function within the select query. What's the recommended way of being able to apply filters to the count statement if I can't call a function in there?

Also, using the lambdas like above seems really funky, but it's the only way I could get it to work.

kozlovsky commented 11 years ago

I've got what you mean; let me think on how that can be done. I'll get back to you shortly on this.

As a side note, I'd like to point out that the way you use lambda has a serious vulnerability known as SQL injection. In the following statement you inject the values of two variables column_name and value into the lambda text.

    for column_name, value in ...
        query = query.filter("lambda x: x.%s == '%s'" % (column_name, value))

The column_name might work fine because you check if such column exists in the model._columns_ set first. But when you substitute value, it can be a problem, because it can contain a malicious code. You should never substitute the data received from the user into the query text. Instead, you can just put the name of the variable into the query text. Pony will take the value for this variable from locals. This way the value will be passed to the query as a parameter which completely eliminates the risk of SQL injection:

    for column_name, value in ...
        query = query.filter("lambda x: x.%s == value" % column_name)

In this case Pony gets the value of the value variable from locals and passes it to the query as a parameter. This completely eliminates the risk of SQL injection.

I’ll get back to you soon.

JakeAustwick commented 11 years ago

Hey,

Thanks for the quick reply. Yeah I was aware of the injection, but only I'll be the only one using this page and it's password protected, so went with it anyway for now because I wasn't aware I could just substitute local variables in. Will change that now.

kozlovsky commented 11 years ago

Back to your question:

Since filter() is only applicable to the main query, you cannot apply it to a subquery. But there is another way to solve this problem -- it is possible to write entire query as a text. For example, instead of a query:

    x = 100
    select(p.name for p in Product if p.price < x)

it is possible to write:

    x = 100
    result = select("p.name for p in Product if p.price < x")[:]

Then you can dynamically generate some conditions and insert them into the query text:

    conditions = ...
    result = select("p.name for p in Product if p.price < x and %s" % conditions)[:]

This way works for subqueries too:

    conditions = ...
    query_text = """
        (c, count(p for p in category.products if %s)) for c in Category
    """ % conditions
    result = select(query_text)[:]

So, this is how I'd solve your problem. First, I'd construct the params dictionary with the name-value pairs for dynamic query parameters:

    model = SendPin  # if this code placed inside function, this may be function parameter
    params = {}
    for column_name, value in request.args.iteritems:
        if column_name in model._columns_ and value:
            params[column_name] = value

Then I'd construct string with the dynamically generated conditions:

    conditions = ' '.join('and x.%s == params["%s"]' % (param_name, param_name)
                          for param_name in params)

I avoided SQL injection here by using expression params["some_param_name"] instead of substituting the param value.

Then it is possible to construct the full query text and execute query:

    query_text = """
        (a, count(x for x in model
                  if x.aff_id == a.id %s))
        for a in Affiliate
    """ % conditions

    # values of 'model' and 'params' variables will be used here
    result = select(query_text)[:]

Hope this solves your problem.

Also I’d like to let you know that we now have the mailing list where you can ask any questions regarding PonyORM or suggest new features. This way is more convenient then creating issues on GitHub, because Github tracker is used mainly for bugs. Please join our mailing list at http://ponyorm-list.ponyorm.com, I will be glad to answer any of your questions there :)

JakeAustwick commented 11 years ago

Didn't realise about the mailing list, my bad.

Looks like that method will work, it seems a little bit funky to use strings everywhere, but I guess if that's the only way then I'll go with it.

Is there a specific reason that functions aren't allowed in queries? Couldn't the function just be replaced with it's return value and then you run the usual magic pony code on it. My initial method just seems a much nicer way of doing this (if functions were allowed in queries).

Thanks, Jake

kozlovsky commented 11 years ago

When Pony translates a query, it must determine have to evaluate each expression contained in this query. There are two ways expression can be handled:

  1. It may be evaluated in Python before the query was sent to the database, and then the final result of expression substituted into the query as a parameter. It is possible to use user-defined functions inside such expressions.
  2. It may be translated into SQL and sent to the database as a part of SQL query. It is not possible to use user-defined functions here, because Pony doesn't know how to translate such functions to SQL.

For example, let's consider the query:

    select(p for p in Product if p.price < int(request.args["price"])
                                 and p.name.startswith("A"))

In this query, there are expression int(request.args["price"]) which is evaluated before query execution and then the resulted value used as a query parameter. You can see built-in int function here, but any used-defined function can be placed here as well. This part of query evaluated in Python.

Also, you can see expression p.name.startswith("A"). This expression is evaluated not as a single value, but for each row of corresponding database table, so it must be translated to SQL as p.name LIKE 'A%'. Because of this, it is not allowed to use arbitrary user-defined functions here, because such arbitrary user-defined function cannot be translated to SQL.

In your query, model_with_filters() function evaluated multiple times, one time for each row of query result. Therefore, it cannot be executed in Python and must be translated to SQL. But translation of arbitrary Python code to SQL is too complex task, therefore Pony can manage translation of generators and lambdas only.

On the other side, Pony allows user-defined functions which evaluate into single value or into list of values which then used as a parameter or as a list of parameters. For example, in our pony.orm.examples.estore example we can define function which return list of Products with price in specified interval:

>>> from pony.orm.examples.estore import *
>>> def products_by_price(min_price, max_price):
...     return select(p for p in Product if p.price >= min_price
...                                         and p.price <= max_price)[:]
... 
>>> products_by_price(150, 300)

SELECT "p"."id", "p"."name", "p"."description", "p"."picture", "p"."price", "p"."quantity"
FROM "Product" "p"
WHERE "p"."price" >= ?
  AND "p"."price" <= ?

[Product[1], Product[6]]

Then we can use this function in a query "select all categories which include specified products"

>>> select(c for c in Category for p in c.products
             if p in products_by_price(150, 300))[:]

SELECT DISTINCT "c"."id", "c"."name"
FROM "Category" "c", "Category_Product" "t-1"
WHERE "t-1"."product" IN (?, ?)
  AND "c"."id" = "t-1"."category"

[Category[1], Category[3], Category[4]]

Our function evaluates into a list of two products, and then you can see SQL condition WHERE "t-1"."product" IN (?, ?). Pony can translate this query, because it is not required to translate function code to SQL here.

JakeAustwick commented 11 years ago

I understand that compiling arbitrary user functions to SQL would be impossible, however in my specific case my function actually returns a built in pony function.

See:

return query.filter("lambda x: x.aff_id== aid")

So would it not be possible in this case to translate it to SQL?

kozlovsky commented 11 years ago

There are two ways translation to SQL can be done:

  1. Translation of function code
  2. Translation of function result value

In your case, you suggest to translate function result only, but to obtain this result we must evaluate function code, and to evaluate function code we must calculate function arguments at first.

The problem is with second argument aid in function definition:

    def model_with_filters(model, aid, filters): ...

This argument take its value from specific instance of Affiliate object from the database:

select((a, model_with_filters(SendPin, a.aff_id, request.args).count) for a in Affiliate)

This means that to execute this function we must at first retrieve aff_id value from the database. Because of this, we cannot evaluate function and analyze its result value before SQL query execution.

If all of the function arguments can be calculated before retrieving data from the database, Pony will execute function and then will analyze its result value.

JakeAustwick commented 11 years ago

Ahh, I understand the problem now. Thanks for being so in depth with your explanations.

kozlovsky commented 11 years ago

Glad to hear, I'll wish more Pony users understand expression evaluation details in Pony, but this topic sometimes can be really tricky to understand and describe :)

As a semi-related node, in future I plan to add user-defined methods and properties to entities which can be used inside query. This methods must be one-liners like lamdas. There is the rough example of future syntax:

    class Product(db.Entity):
        name = Required(unicode)
        price = Required(Decimal)
        quantity = Required(int)

        @property
        def in_stock(self):
            return self.quantity > 0

        def cheaper_then(self, price):
            return self.price < price

    ...

    select(p for p in Product if p.in_stock and p.cheaper_then(500))

This is just a syntactic sugar and doesn't a big deal in any way, but may simplify some complex queries a bit.

JakeAustwick commented 11 years ago

I think it's a great idea though, makes the queries more obvious at first glance.

I think a convenience function similar to what I am doing would be useful, although I guess you don't want to clutter the library. A function that filters by dict {'column', 'value'} is a fairly common use case I would imagine.

filter_dict = {'user_id': 6, 'category_id' : 29}
select(bp for bp in BlogPost).filter_by_dict(filter_dict)
kozlovsky commented 11 years ago

I don't mind adding some helper method/function for this use case, but I'm slightly in doubt this use case (as specified) is common enough. It seems to me, in many cases other operation beside equality comparison should be used. For example, I can imagine query like this:

    select(p for p in Product
        if p.name.startswith(request.args["name"])
        and p.price >= (request.args["min_price"])
        and p.price <= (request.args["max_price"])
        and request.args["category_name"] in p.categories.name
        and p.date_first_available.year == request.args["year"]
    )[:int(request.args["products_per_page"])]

I think that situation when equality operation used for all dynamic parameters may be not very common in practice to justify method adding. Probably we should gather more statistics about typical use-cases at first.

But maybe I can add **kwargs argument to the current filter method to allow such queries:

    params = { 'bp.user.id' : 6, 'bp.category.id' : 29 }
    q = select(bp for bp in BlogPost).filter(**params)

(By the way, I'm wonder about you entities. In Pony it is not common to have property names like "user_id" or "category_id", because it is more convenient to have properties "user" and "category" whose types are entities and not just some ids. If you diagram is not a secret, you can send it to me at alexander.kozlovsky@gmail.com as a python file with entity definitions, and I can give some recommendation about this diagram design in Pony)

JakeAustwick commented 11 years ago

Yeah, a filter method which accepted **kwargs would be exactly what I was going for. I tried that originally but then realised it wasn't implemented, hence going with the loop approach that I used.

And the user_id and category_id aren't examples of my actual models, I just used them as examples. I do have entities made for all my database tables.

JakeAustwick commented 10 years ago

Any update on this? I think a filter() method that accepted **kwargs would be a useful addition to pony.

kozlovsky commented 10 years ago

Sorry about that, was busy with implementing the updated transaction model. Agree, that would be an useful addition, this is on our list, will implement it soon. Thanks!

JakeAustwick commented 10 years ago

Awesome addition, thanks