tada / pljava

PL/Java is a free add-on module that brings Java™ Stored Procedures, Triggers, Functions, Aggregates, Operators, Types, etc., to the PostgreSQL™ backend.
http://tada.github.io/pljava/
Other
238 stars 77 forks source link

How to implement a special top-N group function? #494

Open shenjianrong opened 6 days ago

shenjianrong commented 6 days ago

The table orders has some fields as following: orderkey, product, amount, order_month, ...... I want to implement a UDF to use like: select product, udf_topN(amount, -5) from orders group by product this query means selecting the top 5 records by amount descending sort in each product group. Can I implement this function using PL/Java and how to do? Please tell me where I can find such documents or information.

jcflack commented 5 days ago

I think that the way you want the udf_topN function to work may be impractical in any PL, as you propose a function that is simply called in the select-list of a query, but ends up somehow influencing the ORDER BY and LIMIT of the whole query that contains it. Also, because this function is applied to a non-grouped column in a query with a GROUP BY, it will need to be an aggregate function (I assume you intend the semantics of sum).

It is easy enough to write an aggregate function in PL/Java, the difficulty still remains in having a function that is simply called within a query somehow affect the ORDER BY and LIMIT of the containing query. Even if such a function could somehow be written, such magical behavior would tend to astonish anyone trying to understand a query that uses it.

If you wanted instead to write an orders_topN function that did the whole query, that would be simple. so simple PL/Java would hardly be necessary, and plain SQL would suffice:

CREATE OR REPLACE FUNCTION orders_topN(n integer)
 RETURNS TABLE(product text, amount integer)
 STABLE
BEGIN ATOMIC
 SELECT product, total_amount
    FROM (
      SELECT product, sum(amount) AS total_amount
      FROM orders
      GROUP BY product
    ) AS q
  ORDER BY total_amount*sign(n)
  LIMIT abs(n);
END
$ select (orders_topN(1)).*;
 product | amount 
---------+--------
 tokamak |      2

$ select (orders_topN(-1)).*;
 product | amount 
---------+--------
 widget  |     31

(The reason for the nested SELECT is so that the outer ORDER BY can use total_amount in an expression (like total_amount*sign(n)). When ORDER BY just gives a name, that's allowed to be a column name in the result the query will create, but in ORDER BY any kind of expression, the names in the expression must be things already known.

If what you want is a function that takes n and the name of a column, that will aggregate just the named column, grouping by all the others, and applying an order and limit, that would be trickier. The caller would have to supply the name of the column (when you just use the column in the function call as your example does, only a value is passed, and the function doesn't know where it came from). The function would then have to generate appropriate SQL and execute it.

If you wanted a function like that but polymorphic, so you could give it any table to operate on and have it return rows of that table's type (but grouped and with the named one aggregated), there might be a way to accomplish that, using PostgreSQL's support for polymorphic functions. It probably wouldn't be pretty, and might need a dummy argument of the table's row type. I'm not sure it would be worth the effort, except for the pure challenge of it.