Open mrocklin opened 7 years ago
before doing so, I would suggest normalizing the input/output. For ex:
class StreamDoc(dict):
def __init__(self, *args, **kwargs):
'''
Normalizing inputs/outputs
attributes : the metadata
outputs : a dictionary of outputs
argnames : a list of keys into dictionary of outputs of what the arguments are
kwargnames : a list of keys into dictionary of outputs of what the kwargs are
'''
self['attributes'] = dict()
self['outputs'] = dict(**kwargs)
self['kwargnames'] = list(kwargs.keys())
self['argnames'] = list()
for i, arg in enumerate(args):
key = "_arg{}".format(i)
self['outputs'][key] = arg
self['argnames'].append(key)
# needed to distinguish that it is a StreamDoc by stream methods
self['_StreamDoc'] = 'StreamDoc v1.0'
#... add methods to select/add /remove inputs/outputs, modify attributes etc
The reason why I mention this is that I think this would be coupled with the collections idea. A list of things coming from a stream could be interpreted in a few ways:
For example, method 1 would yield something like:
def add(a,b):
return a + b
s1 = Stream()
s2 = s1.map(add)
s1.emit([1, 2, 'stop', 1, 3, 'stop'])
s2.sink(print)
whereas method 2 would be something like:
def add(a,b):
return a + b
s1 = Stream()
s2 = s1.map(add)
#roughly...
s1.emit([StreamDoc(1,2), StreamDoc(1,3)])
s2.sink(print)
If adding collections, was one of these two methods in mind? Or is there another better way? Just some thoughts. I'm inclined to go for method 2. We're currently using dask and have an object similar to that of method 2.
For method 2, putting the arguments into the function could be as simple as decorating the update
routines. Something like:
def parse_streamdoc(name):
def streamdoc_dec(f):
def f_new(x, **kwargs_additional):
args = list()
for argkey in x['argnames']:
args.append(x['outputs'][argkey])
# same for kwargs
kwargs.update(kwargs_additional)
return f(*args, **kwargs)
return f_new
return stream_dec
etc... we use something similar so it didn't require much time to post here (don't worry, I'm not making the assumption this library is something to depend on, but I'm hoping for it! :-) )
Anyway, I'd be happy to hear thoughts, and definitely would like to hear your insight on potential shortcomings of this approach as simply the ideas can help us on our end. (We're working on better shaping API for our experimental xray beamline)
We probably want a Stream variant that moves around not individual elements, but batches or sequences of elements. We probably also want a Stream variant that moves around Pandas dataframes. Each of these would probably want a different API. Tor example
map
on a batched stream might look like the following:However each of these new collection-wise interfaces would probably want to compose with both the lower level local and dask Stream objects.
To that end maybe it makes sense to encapsulate a low-level stream within a user-level stream.