uwdata / arquero

Query processing and transformation of array-backed data tables.
https://idl.uw.edu/arquero/
BSD 3-Clause "New" or "Revised" License
1.29k stars 63 forks source link

array_agg and undefined/none values #333

Open carsten-jahn opened 1 year ago

carsten-jahn commented 1 year ago

Thanks a lot for this great library!

I came across an issue with array_agg, I would like to preserve undefined / null values in my dataset and keep them in the aggregated array. However, arquero is skipping the undefined values and the result of array_agg is a shorter array.

I tried to implement a custom aggregate function, but the result was the same, i.e. these values seem to be filtered out before the aggregation is invoked.

It would be great if there was a way of just aggregating all values and "non-values" in an array.

dldx commented 10 months ago

I just discovered the same issue. It seems like a bug to me?

@carsten-jahn Did you find a solution for this?

How to reproduce:

aq.table({ v: [1, null, 1, 2, 3, 1] })
  .rollup({ a: op.array_agg('v') }) // [1, 1, 2, 3, 1]
carsten-jahn commented 10 months ago

Hi @dldx , I looked into this again. I don't have an elegant solution for this in arquero. All I can do is setting the null values to a specific constant before calling array_agg, and eventually replacing the constant with null before using the array elsewhere.

I had a look into implementing a custom aggregator function as described in https://uwdata.github.io/arquero/api/extensibility#addAggregateFunction and https://observablehq.com/@uwdata/adding-aggregate-functions-to-arquero , however this doesn't help either, as its add function is called for every "valid" element only. The state visible in the aggregator does tell you how many invalid elements there are, but you cannot know the order in which those appear.

dldx commented 10 months ago

@carsten-jahn Thanks for replying! I will try to investigate further :)