per pyspark group and batch of x rows and another I can't find (probably deleted) that wanted api calls bucketed it seems there is a gap for partition id with a maxed counter, so stateful with an increment every x rows, and reset count on new partition.
This would allow chunking but ideally the chunks would be mappable, so perhaps combined with a collect_set and a custom udf.
per pyspark group and batch of x rows and another I can't find (probably deleted) that wanted api calls bucketed it seems there is a gap for partition id with a maxed counter, so stateful with an increment every x rows, and reset count on new partition.
This would allow chunking but ideally the chunks would be mappable, so perhaps combined with a collect_set and a custom udf.