kaskada-ai / kaskada

Modern, open-source event-processing
https://kaskada.io/
Apache License 2.0
351 stars 15 forks source link

bug: unexpected non-null behavior seen from `when` function #292

Open jordanrfrazier opened 1 year ago

jordanrfrazier commented 1 year ago

Description when(condition) is expected to filter out rows where the condition is false or null. This happens as expected; however, if the output of a when is merged with rows at the same time, something seems to be populating that output value as a non-null value (I think, the last non-null value, implying that the merge is caching the value incorrectly. Interpolation issue, perhaps?).

To Reproduce Steps to reproduce the behavior:

  1. Run the ignored test in when_tests.rs: test_when_output_resets_to_null.

Actual Behavior The results show:

async fn test_when_output_resets_to_null() {
    insta::assert_snapshot!(QueryFixture::new("{ \
        count_page: count(PageViews), \
        purchase_is_valid: is_valid(Purchases), \
        count_when_valid: count(PageViews) | when(is_valid(Purchases)) }").run_to_csv(&purchase_fixture().await).await.unwrap(), @r###"
    _time,_subsort,_key_hash,_key,sum_field
    "###);
}

          0 │+_time,_subsort,_key_hash,_key,count_page,purchase_is_valid,count_when_valid␊
          1 │+2022-10-25T00:00:00.000000000,15615443869102979449,1644192944307425184,Davor,1,,␊
          2 │+2022-10-26T00:00:00.000000000,15615443869102979450,12688524802574118068,Ben,1,,␊
          3 │+2022-10-27T00:00:00.000000000,1305746571793151907,12688524802574118068,Ben,1,true,1␊
          4 │+2022-10-27T00:00:00.000000000,1305746571793151908,1644192944307425184,Davor,1,true,1␊
          5 │+2022-10-28T00:00:00.000000000,15615443869102979451,12688524802574118068,Ben,2,,1␊
          6 │+2022-11-01T00:00:00.000000000,15615443869102979452,12688524802574118068,Ben,3,,1␊
          7 │+2022-11-01T00:00:00.000000000,15615443869102979453,1644192944307425184,Davor,2,,1␊
          8 │+2022-11-02T00:00:00.000000000,1305746571793151909,12688524802574118068,Ben,3,true,3␊
          9 │+2022-11-02T00:00:00.000000000,1305746571793151910,1644192944307425184,Davor,2,true,2␊
         10 │+2022-11-24T00:00:00.000000000,15615443869102979454,1644192944307425184,Davor,3,,2␊
         11 │+2022-11-25T00:00:00.000000000,15615443869102979455,1644192944307425184,Davor,4,,2␊
         12 │+2022-11-26T00:00:00.000000000,15615443869102979456,1644192944307425184,Davor,5,,2␊
         13 │+2022-11-27T00:00:00.000000000,1305746571793151911,1644192944307425184,Davor,5,true,5␊
         14 │+2022-12-10T00:00:00.000000000,15615443869102979457,12688524802574118068,Ben,4,,3␊
         15 │+2022-12-12T00:00:00.000000000,1305746571793151912,12688524802574118068,Ben,4,true,4␊
         16 │+2023-01-01T00:00:00.000000000,1305746571793151913,12688524802574118068,Ben,4,true,4␊
         17 │+2023-01-01T00:00:00.000000000,15615443869102979459,1644192944307425184,Davor,6,,5␊
         18 │+2023-02-07T00:00:00.000000000,15615443869102979460,12688524802574118068,Ben,5,,4␊
         19 │+2023-12-31T00:00:00.000000000,15615443869102979458,12688524802574118068,Ben,6,,4␊

Expected Behavior Expected the value of count_when_valid to be null when the purchase_is_valid value is either null or false.

Additional context when produces discrete values, meaning that we should not be caching the last non-null value anywhere. Running the test just with the final feature illustrates the difference:

async fn test_when_output_resets_to_null() {
    insta::assert_snapshot!(QueryFixture::new("{ \
        count_when_valid: count(PageViews) | when(is_valid(Purchases)) }").run_to_csv(&purchase_fixture().await).await.unwrap(), @r###"
    _time,_subsort,_key_hash,_key,sum_field
    "###);
}

          0 │+_time,_subsort,_key_hash,_key,count_when_valid␊
          1 │+2022-10-27T00:00:00.000000000,1305746571793151907,12688524802574118068,Ben,1␊
          2 │+2022-10-27T00:00:00.000000000,1305746571793151908,1644192944307425184,Davor,1␊
          3 │+2022-11-02T00:00:00.000000000,1305746571793151909,12688524802574118068,Ben,3␊
          4 │+2022-11-02T00:00:00.000000000,1305746571793151910,1644192944307425184,Davor,2␊
          5 │+2022-11-27T00:00:00.000000000,1305746571793151911,1644192944307425184,Davor,5␊
          6 │+2022-12-12T00:00:00.000000000,1305746571793151912,12688524802574118068,Ben,4␊
          7 │+2023-01-01T00:00:00.000000000,1305746571793151913,12688524802574118068,Ben,4␊
kerinin commented 1 year ago

Possibly related - I've seen cases where these produce different results:

foo | when(p1) | when(p2)
foo | when(p1 and p2)

This may be more predictable if we only produce discrete values where the RHS is defined at the time of the LHS (ie, don't fabricate null rows)