buffering tuple data - Githubissues

ejhumphrey commented 6 years ago

currently buffering only flies for data that's structured as a dictionary. As a result, I generally do something like:

define generator that produces dictionaries
buffer dictionary data
turn into tuples

This is equally a result of historical reasons and being lazy when naming Keras outputs (you name the layer but the output gets the tag? :shrug:). That said, I don't think it'd be too hard to implement, and I don't think it'd come at enough of a runtime cost to not warrant it:

def __stack_data(data):
    output = dict()
    sample = data[0]
    is_dict = hasattr(sample, 'keys')
    keys = sample.keys() if is_dict else range(len(sample))
    for key in keys:
        output[key] = np.array([x[key] for x in data])
    return output if is_dict else tuple(output.values())

which would yield the following:

>>> dict_data = [dict(x=np.ones(10), y=np.array(1)) for _ in range(5)]
print(__stack_data(dict_data))
{'x': array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]), 'y': array([1, 1, 1, 1, 1])}

>>> iter_data = [(np.ones(10), np.array(1)) for _ in range(5)]
>>> print(__stack_data(iter_data))
(array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]), array([1, 1, 1, 1, 1]))

A quick check with timeit on some larger data (len(x)=128**2) says there's no observable difference in runtime.

thoughts?

bmcfee commented 6 years ago

I think it would be simpler, at the cost of some redundancy, to just have a separate function for buffering tuples. Presumably, a user should know when they're using dicts vs tuples, and be able to choose the appropriate function.

Side note: it's not safe to use sample.keys() as above: dictionary keys are not guaranteed to be sorted (or have consistent order) in all python versions, so you could run into inconsistencies.

cjacoby commented 6 years ago

An incomplete thought; what about namedtuples? namedtuples are nice, 'cause you can treat them sort of either way, though admittedly, namedtuples don't exactly behave like dicts, but they have order, and you can get a dict back from them. (._asdict()). Does it make any sense to have our existing buffering functionality aware of namedtuples as an intermediate solution here? I suppose you could also just have a map function that turns namedtuples into dicts before the buffer function.

Related Side note: in python 3.7 there will also be data classes: https://www.python.org/dev/peps/pep-0557/

I feel conflicted here; on one hand I agree with Brian that it might be better to just have separate functions. But, the main __stack_data function could also be smarter, though I think maybe that just gets unnecessarily complicated, so maybe bmcfee's solution is best for now.

This could be combined with the change for #127; general improvements for buffering.

On Thu, Mar 1, 2018 at 6:26 AM Brian McFee notifications@github.com wrote:

I think it would be simpler, at the cost of some redundancy, to just have a separate function for buffering tuples. Presumably, a user should know when they're using dicts vs tuples, and be able to choose the appropriate function.

Side note: it's not safe to use sample.keys() as above: dictionary keys are not guaranteed to be sorted (or have consistent order) in all python versions, so you could run into inconsistencies.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/130#issuecomment-369607805, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4t8yox6ZSRQBMDxy926xUIM65As40pks5taAUagaJpZM4SYNqh .

ejhumphrey commented 6 years ago

hrm, good call on the unsorted keys, def forgot about that. NamedTuples might give us what we want. I also suppose that incorporating #129 into core and potentially wrapping buffer_batch and tuples with this solves the issue for me. That could look something like...

@pescador.streamable
def some_data(value):
    while True:
        yield dict(x=value * np.ones(10), y=np.array(value))

stream = pescador.mux.StochasticMux(
    [some_data(v) for v in range(50)], 
    n_active=5, rate=4)
batches = pescador.buffer_batch(stream, 10)
tuples = pescador.tuples(batches, 'x', 'y')
for x, y in tuples.iterate(max_iter=3):
    # do something...

which works for me.

bmcfee commented 5 years ago

Having mulled this over, I'm conflicted. I like being as type- and data-agnostic as possible to keep pescador's functionality simple, but this makes it difficult to reason about what map behavior should be and what makes sense to include in the library.

Categorically, I think tuples should only be produced as last-stage processing of streams, and we shouldn't bend over backward to support manipulation of tuple data.

This raises the question of what kind of data we should support mid-stream. Any kind of object? Only containers? Only dict-like containers? Right now we (implicitly) rely on a dict-of-ndarray pattern, but it would be great if we could relax that or make it more explicit / adaptable somehow. I think dataclasses show some promise for this, but it's probably not a great idea to lean on them too heavily at this point since they're still new.

This is all a long-winded way of saying that I don't think we should get into the business of explicitly adding support for buffering tuples. If you want that effect, use dicts for now and buffer at the end.

pescadores / pescador

buffering tuple data #130