python-streamz / streamz

Real-time stream processing for python
https://streamz.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
1.24k stars 149 forks source link

Compose with collections #2

Open mrocklin opened 7 years ago

mrocklin commented 7 years ago

We probably want a Stream variant that moves around not individual elements, but batches or sequences of elements. We probably also want a Stream variant that moves around Pandas dataframes. Each of these would probably want a different API. Tor example map on a batched stream might look like the following:

class map(BatchStream):
    def __init__(self, func):
        ...
    def update(self, batch):
        new_batch = list(builtins.map(self.func, batch))
        self.emit(new_batch)

However each of these new collection-wise interfaces would probably want to compose with both the lower level local and dask Stream objects.

To that end maybe it makes sense to encapsulate a low-level stream within a user-level stream.

jrmlhermitte commented 7 years ago

before doing so, I would suggest normalizing the input/output. For ex:

class StreamDoc(dict):
    def __init__(self, *args, **kwargs):
        '''
              Normalizing inputs/outputs
              attributes : the metadata
              outputs : a dictionary of outputs
              argnames : a list of keys into dictionary of outputs of what the arguments are
              kwargnames : a list of keys into dictionary of outputs of what the kwargs are
        '''
        self['attributes'] = dict()
        self['outputs'] = dict(**kwargs)
        self['kwargnames'] = list(kwargs.keys())
        self['argnames'] = list()
        for i, arg in enumerate(args):
            key = "_arg{}".format(i)
            self['outputs'][key] = arg
            self['argnames'].append(key)
        # needed to distinguish that it is a StreamDoc by stream methods
        self['_StreamDoc'] = 'StreamDoc v1.0'
#... add methods to select/add /remove inputs/outputs, modify attributes etc

The reason why I mention this is that I think this would be coupled with the collections idea. A list of things coming from a stream could be interpreted in a few ways:

  1. ordered sequence in how to input arguments
  2. separate computations for each argument.

For example, method 1 would yield something like:

def add(a,b):
    return a + b

s1 = Stream()
s2 = s1.map(add)

s1.emit([1, 2, 'stop', 1, 3, 'stop'])
s2.sink(print)

whereas method 2 would be something like:

def add(a,b):
    return a + b

s1 = Stream()
s2 = s1.map(add)

#roughly...
s1.emit([StreamDoc(1,2), StreamDoc(1,3)])
s2.sink(print)

If adding collections, was one of these two methods in mind? Or is there another better way? Just some thoughts. I'm inclined to go for method 2. We're currently using dask and have an object similar to that of method 2.

For method 2, putting the arguments into the function could be as simple as decorating the update routines. Something like:

def parse_streamdoc(name):
    def streamdoc_dec(f):
        def f_new(x, **kwargs_additional):
            args = list()
            for argkey in x['argnames']:
                args.append(x['outputs'][argkey])
           # same for kwargs
           kwargs.update(kwargs_additional)
           return f(*args, **kwargs)
        return f_new
    return stream_dec

etc... we use something similar so it didn't require much time to post here (don't worry, I'm not making the assumption this library is something to depend on, but I'm hoping for it! :-) )

Anyway, I'd be happy to hear thoughts, and definitely would like to hear your insight on potential shortcomings of this approach as simply the ideas can help us on our end. (We're working on better shaping API for our experimental xray beamline)