laughingman7743 / PyAthena

PyAthena is a Python DB API 2.0 (PEP 249) client for Amazon Athena.
MIT License
464 stars 105 forks source link

Supporting FILTER and other similar operations #480

Open yuvi17 opened 1 year ago

yuvi17 commented 1 year ago

Athena has support for filter function that can be used to filter values in a array and return the filtered array. https://docs.aws.amazon.com/athena/latest/ug/filtering-arrays.html

It will be quite helpful if we can have an equivalent function in pyathena's SQLAlchemy dialect.


Use Case

I have a table that contains an array of timestamps and an action performed by an actor. Something like

[ {"timestamp": "2023-10-10", "action_count": 5}, {"timestamp": "2023-10-11", "action_count": 6}...]

I would want to write a query which can give me something like, count of times this action was performed more than 3 times b/w certain dates. Or users who performed this action a total of more than 3 times b/w certain dates.

If had to write Athena SQL for this, it would look something like this

Select * from actors
where
count(filter(action, x -> x.timestamp <= '2023-10-10' and x.timestamp >= '2023-10-01' and x.action_count >= 2))

I would like to do something like this for a table with multiple such columns and using and or etc. between those.

@laughingman7743 Is there any way I can do this today?

I hate the idea of writing the query as Python F-strings and filling in the values.

yuvi17 commented 1 year ago

A query built using sqlalchemy + pyathena something like this

query = select().where(
    func.cardinality(func.filter(
        "action",
        lambda x: (
            func.cast(x['timestamp'], Date) <= '2020-09-07',
            func.cast(x['timestamp'], Date) >= '2020-06-01',
            func.cast(x['action_count'], Integer) > 3
        )
    ))
)

results in a UnsupportedCompilationError with exceptions

        operator_ = override_operator or binary.operator
        disp = self._get_operator_dispatch(operator_, "binary", None)
        if disp:
            return disp(binary, operator_, **kw)
        else:
            try:
>               opstring = OPERATORS[operator_]
E               KeyError: <built-in function getitem>

../../../opt/anaconda3/envs/py39/lib/python3.9/site-packages/sqlalchemy/sql/compiler.py:2201: KeyError

The above exception was the direct cause of the following exception:
.....
.....
.....

        elif replace_context is not None:
            # no good solution here, we would like to have the exception
            # have only the context of replace_context.__context__ so that the
            # intermediary exception does not change, but we can't figure
            # that out.
            exception.__cause__ = replace_context

        try:
>           raise exception
E           sqlalchemy.exc.UnsupportedCompilationError: Compiler <pyathena.sqlalchemy_athena.AthenaStatementCompiler object at 0x7f98cba92190> can't render element of type <built-in function getitem> (Background on this error at: https://sqlalche.me/e/14/l7de)

../../../opt/anaconda3/envs/py39/lib/python3.9/site-packages/sqlalchemy/util/compat.py:208: UnsupportedCompilationError

This is after creating an engine with athena and passing the dialect to compile function of sqlalchemy.

laughingman7743 commented 1 year ago

If there is anything missing from the implementation in PyAthena, pull requests are welcome.