sparkutils / quality

A Quality Spark DQ Library
https://sparkutils.github.io/quality/
Apache License 2.0
4 stars 2 forks source link

group by / create batch of max x rows function #46

Open chris-twiner opened 1 year ago

chris-twiner commented 1 year ago

per pyspark group and batch of x rows and another I can't find (probably deleted) that wanted api calls bucketed it seems there is a gap for partition id with a maxed counter, so stateful with an increment every x rows, and reset count on new partition.

This would allow chunking but ideally the chunks would be mappable, so perhaps combined with a collect_set and a custom udf.

chris-twiner commented 1 year ago

another one https://stackoverflow.com/questions/76908648/pass-multiple-rows-to-function-using-spark-udf