risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://www.risingwave.com/slack
Apache License 2.0
6.75k stars 557 forks source link

bug: a reasonable SQL can't be ran due to NLJ #9907

Open TennyZhuang opened 1 year ago

TennyZhuang commented 1 year ago

Describe the bug

-- create table
CREATE TABLE address_balances (
    address TEXT,
    balance BIGINT
);

-- insert sample data
INSERT INTO address_balances (address, balance)
VALUES
    ('a1', 100),
    ('a2', 10000),
    ('a3', 100000),
    ('a4', 1000000);

-- run query
SELECT
    balance_range,
    COUNT(*) AS num_addresses,
    SUM(balance) AS total_balance,
    (SUM(balance) / (SELECT SUM(balance) FROM address_balances)) * 100 AS percentage_of_total_supply
FROM (
    SELECT
        address,
        balance,
        CASE
            WHEN balance >= 1000000 THEN 'whales'
            WHEN balance >= 10000 AND balance < 1000000 THEN 'dolphins'
            ELSE 'shrimps'
        END AS balance_range
    FROM
        address_balances
) AS balance_ranges
GROUP BY
    balance_range;

To Reproduce

Explain the SQL, then

 BatchProject { exprs: [$expr1, count, sum(address_balances.balance), ((sum(address_balances.balance) / sum(sum(address_balances.balance))) * 100:Decimal) as $expr2] }
 └─BatchNestedLoopJoin { type: LeftOuter, predicate: true }
   ├─BatchExchange { order: [], dist: Single }
   | └─BatchHashAgg { group_key: [$expr1], aggs: [count, sum(address_balances.balance)] }
   |   └─BatchExchange { order: [], dist: HashShard($expr1) }
   |     └─BatchProject { exprs: [Case((address_balances.balance >= 1000000:Int32), 'whales':Varchar, ((address_balances.balance >= 10000:Int32) AND (address_balances.balance < 1000000:Int32)), 'dolphins':Varchar, 'shrimps':Varchar) as $expr1, address_balances.balance] }
   |       └─BatchScan { table: address_balances, columns: [balance] }
   └─BatchSimpleAgg { aggs: [sum(sum(address_balances.balance))] }
     └─BatchExchange { order: [], dist: Single }
       └─BatchSimpleAgg { aggs: [sum(address_balances.balance)] }
         └─BatchScan { table: address_balances, columns: [balance] }
(11 rows)

Here is a NLJ while the two tables are really small (3 rows and 1 rows), is there some solution so that we can run it in streaming mode?

Expected behavior

No response

Additional context

No response

chenzl25 commented 1 year ago

A workaround to make it possible to run this streaming query in our system, but it is hard to write that query. It seems something is wrong in our binder when binding a subquery (need an investigation).

CREATE TABLE address_balances (
    address TEXT,
    balance BIGINT,
    col int default 1
);

INSERT INTO address_balances (address, balance)
VALUES
    ('a1', 100),
    ('a2', 10000),
    ('a3', 100000),
    ('a4', 1000000);

create materialized view v as  SELECT
    balance_range,
    COUNT(*) AS num_addresses,
    SUM(balance) AS total_balance,
    (SUM(balance) / (SELECT SUM(balance) FROM (select * from address_balances where col = balance_ranges.col) T  ) ) * 100 AS percentage_of_total_supply
FROM (
    SELECT
        address,
        col,
        balance,
        CASE
            WHEN balance >= 1000000 THEN 'whales'
            WHEN balance >= 10000 AND balance < 1000000 THEN 'dolphins'
            ELSE 'shrimps'
        END AS balance_range
    FROM
        address_balances
) AS balance_ranges
GROUP BY
    balance_range, col;
github-actions[bot] commented 2 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄