Open ejhumphrey opened 6 years ago
I think it would be simpler, at the cost of some redundancy, to just have a separate function for buffering tuples. Presumably, a user should know when they're using dicts vs tuples, and be able to choose the appropriate function.
Side note: it's not safe to use sample.keys()
as above: dictionary keys are not guaranteed to be sorted (or have consistent order) in all python versions, so you could run into inconsistencies.
An incomplete thought; what about namedtuples? namedtuples are nice, 'cause
you can treat them sort of either way, though admittedly, namedtuples don't
exactly behave like dicts, but they have order, and you can get a dict
back from them. (._asdict()
). Does it make any sense to have our existing
buffering functionality aware of namedtuples as an intermediate solution
here? I suppose you could also just have a map
function that turns
namedtuples into dicts before the buffer function.
Related Side note: in python 3.7 there will also be data classes: https://www.python.org/dev/peps/pep-0557/
I feel conflicted here; on one hand I agree with Brian that it might be
better to just have separate functions. But, the main __stack_data
function could also be smarter, though I think maybe that just gets
unnecessarily complicated, so maybe bmcfee's solution is best for now.
This could be combined with the change for #127; general improvements for buffering.
On Thu, Mar 1, 2018 at 6:26 AM Brian McFee notifications@github.com wrote:
I think it would be simpler, at the cost of some redundancy, to just have a separate function for buffering tuples. Presumably, a user should know when they're using dicts vs tuples, and be able to choose the appropriate function.
Side note: it's not safe to use sample.keys() as above: dictionary keys are not guaranteed to be sorted (or have consistent order) in all python versions, so you could run into inconsistencies.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/130#issuecomment-369607805, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4t8yox6ZSRQBMDxy926xUIM65As40pks5taAUagaJpZM4SYNqh .
hrm, good call on the unsorted keys, def forgot about that. NamedTuples might give us what we want. I also suppose that incorporating #129 into core and potentially wrapping buffer_batch
and tuples
with this solves the issue for me. That could look something like...
@pescador.streamable
def some_data(value):
while True:
yield dict(x=value * np.ones(10), y=np.array(value))
stream = pescador.mux.StochasticMux(
[some_data(v) for v in range(50)],
n_active=5, rate=4)
batches = pescador.buffer_batch(stream, 10)
tuples = pescador.tuples(batches, 'x', 'y')
for x, y in tuples.iterate(max_iter=3):
# do something...
which works for me.
Having mulled this over, I'm conflicted. I like being as type- and data-agnostic as possible to keep pescador's functionality simple, but this makes it difficult to reason about what map behavior should be and what makes sense to include in the library.
Categorically, I think tuples should only be produced as last-stage processing of streams, and we shouldn't bend over backward to support manipulation of tuple data.
This raises the question of what kind of data we should support mid-stream. Any kind of object? Only containers? Only dict-like containers? Right now we (implicitly) rely on a dict-of-ndarray pattern, but it would be great if we could relax that or make it more explicit / adaptable somehow. I think dataclasses show some promise for this, but it's probably not a great idea to lean on them too heavily at this point since they're still new.
This is all a long-winded way of saying that I don't think we should get into the business of explicitly adding support for buffering tuples. If you want that effect, use dicts for now and buffer at the end.
currently buffering only flies for data that's structured as a dictionary. As a result, I generally do something like:
This is equally a result of historical reasons and being lazy when naming Keras outputs (you name the layer but the output gets the tag? :shrug:). That said, I don't think it'd be too hard to implement, and I don't think it'd come at enough of a runtime cost to not warrant it:
which would yield the following:
A quick check with timeit on some larger data (
len(x)=128**2
) says there's no observable difference in runtime.thoughts?