For grouping large datasets on continues data types, binning functions like cut and qcut are essential. However, these are not supported in streaming mode, so currently no aggregation of bins is possible for out of memory datasets.
A little toy example like this results in the following
AGGREGATE
[col("b").mean().alias("mean")] BY [col("bins")] FROM
WITH_COLUMNS:
[col("a").cut().alias("bins")]
STREAMING:
DF ["a", "b"]; PROJECT 2/2 COLUMNS; SELECTION: None
As far as I can tell, cut should work with streaming data without big changes, but qcut needs to see all data in order to extract correct breaks. If this is welcome I could contribute myself, but I do not understand how the streaming API is implemented, so I would need some help.
We will eventually try to make all operations streaming, however having an issue for every operation isn't useful and will cluster the issue board. Therefore I will close this one.
Description
For grouping large datasets on continues data types, binning functions like
cut
andqcut
are essential. However, these are not supported in streaming mode, so currently no aggregation of bins is possible for out of memory datasets.A little toy example like this results in the following
As far as I can tell,
cut
should work with streaming data without big changes, butqcut
needs to see all data in order to extract correct breaks. If this is welcome I could contribute myself, but I do not understand how the streaming API is implemented, so I would need some help.