Open argenisleon opened 4 years ago
This does not seem to be a streaming problem, you should be able to achieve what you want with pandas alone.
Assuming that you have a stream of pandas dataframe rows (so, that would be pandas series) and not just a single dataframe (because then, @martindurant is right: you do not need to use streamz), you could do something like this:
from streamz import Stream
from operator import add
stream = Stream()
stream.partition(100).to_batch(example=pd.DataFrame(columns=["my_column"])) \
.to_dataframe()["my_column"] \
.value_counts() \
.accumulate_partitions(add)
For example, if you feed it with your series:
import pandas as pd
from time import sleep
for i in range(200):
stream.emit(pd.Series({"my_column": i % 16}))
sleep(0.001)
sleep(0.001)
This is to make it "real-time"ish :)
Thinking about it, you could ~aggregate~ accumulate with a collections.Counter
, you don't really need dataframes at all here.
Hi,
I would like to calculate the
.value_counts()
from a pandas dataframe in chunks of n elements and output the aggregated result. For example, I have 1000 elements. Calculate thevalue_counts()
for the first 100 and output the result, aggregate the result to the next 100 elements and output it again.I tried
but I get
BTW I am not sure this is the best approach to do this. Any hint?