kwai / blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Apache License 2.0
1.3k stars 121 forks source link

[BUG] SQL will get Random Result #652

Closed lianneli closed 3 days ago

lianneli commented 3 days ago

Describe the bug A SQL will get different result after execute.

image

image

To Reproduce test sqls are below.

with new_end_stock as (
      select
          a.v_id,
          a.s_code,
          a.mr,
          sp_code               as sp_code,
          sum(a.untaxed_cost)     as untaxed_cost,
          sum(a.taxed_cost) as tax_cost
      from (
               select
                   v_id               as  v_id
                    ,s_code           as  s_code
                    ,mr               as  mr
                    ,sp_code          as  sp_code
                    ,untaxed_cost  as  untaxed_cost
                    ,taxed_cost   as  taxed_cost
                    ,row_number() over (
                       partition by v_id,s_code,mr
                       order by systime desc NULLS LAST
                   ) as rank
               from
                   blaze_t.tb1
               where dt<='20241120'
           ) a
      where a.rank=1
      GROUP BY a.v_id,a.s_code,a.mr,a.sp_code
  )
  , s AS (
    SELECT
         v_id
        ,s_id as s_id
        ,s_code AS s_code
    FROM blaze_t.tb2
 ),
 in_transit_out as (
     select 
     a.sheet_id,a.mr,max(a.sp_id) as sp_id
     ,max(a.v_id) as v_id
     ,max(a.ref_s_id) as s_id  
     ,max(a.unit) as unit
     ,sum(a.d_qty) as d_qty
     ,sum(a.d_cost_value) as  d_taxed_cost
     ,sum(a.d_net_cost_value) as  d_untaxed_cost
     from blaze_t.tb3 a
     where 
         a.dt <='20241120'
     group by a.sheet_id,a.mr
 ),
 in_transit_in as (
     select 
     a.ref_sheet_id,a.mr
     ,sum(a.r_qty) as r_qty
     ,sum(a.r_cost_value) as  r_taxed_cost
     ,sum(a.r_net_cost) as  r_untaxed_cost
     from blaze_t.tb4 a
     where 
         a.dt <='20241120'
     group by a.ref_sheet_id,a.mr
 )
 , in_transit as (
     select 
          a.v_id
         ,max(s.s_code) as s_code
         ,MAX(a.sp_id) as sp_code
         ,a.mr 
     from in_transit_out a  
     inner join s on 
     (
             a.v_id = s.v_id 
         and a.s_id = s.s_id
     )
     left join in_transit_in b on
     (
         a.sheet_id = b.ref_sheet_id 
         and a.mr=b.mr 
     )
     group by a.v_id,a.s_id,a.mr
 ), union_data as
     (SELECT
           COALESCE(a.v_id,b.v_id)                     as v_id
          ,COALESCE(a.s_code,b.s_code)                   as s_code
          ,COALESCE(a.mr,b.mr)                             as mr
          ,COALESCE(a.sp_code,b.sp_code)             as sp_code
          ,COALESCE(a.untaxed_cost,0)                            as untaxed_cost
          ,COALESCE(a.tax_cost,0)                                as tax_cost
     FROM
         new_end_stock a
     FULL JOIN in_transit b on a.v_id=b.v_id and a.s_code=b.s_code and a.mr=b.mr 
 )
 select * from union_data where  s_code = '223' and mr = '164331' and sp_code = '11600375' and v_id = 7

Expected behavior There will be only one result when execute in origin spark.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.