Closed ejhumphrey closed 7 years ago
Yup, I agree, and it's true.
I ran into this while making some examples. (That one, for instance). My solution up to now has been to say in the docs "make sure you yield your samples with the sample dimension first!", putting the responsibility on the user.
A problem here is that streamers (even non-buffered ones!) are allowed to produce multiple "samples" in a "batch". ie
def sample_gen(n_samples=10):
while True:
yield { "X": np.random.random((n_samples, 1)) }
streamer = pescador.Streamer(sample_gen, 100)
for batch in streamer.generate():
# batch['X'].shape == (100, 1)
Additionally, nowhere have we specified that streamers must produce samples with an initial sample-index dimension. This is valid, under current design:
def sample_gen():
while True:
yield {"X": np.random.random()}
streamer = pescador.Streamer(sample_gen)
for sample in streamer.generate():
# sample['X'].shape == (,)
# do something with the sample
However, you could not Buffer this streamer correctly. To my memory, we have no currently defined way to handle this right now.
so I'm starting to think this is fundamentally related to #75 ... and that the answer is to add functionality to core.py
that defines and / or validates sample
-ness and batch
-ness. This could be similar mir_eval
's consistent use of the "intervals", "boundaries" and "samples" concepts.
I generally agree, although perhaps let's pass that discussion back over to #75 to keep it in one place.
Ha, I was going to migrate it here, but that's fine.
I will say though, related to this comment, the current state of my thoughts says that buffer_batch
, while agnostic to what you feed it, should be responsible for adding the sample dimension to an array of items. The design of / relationship between buffer_batch
and __split_batches
feels wrong, probably partly due to how we've defined (or rather, haven't defined) these concepts.
closed with #88
I'm having trouble buffering samples into batches, and I think (think?) that it's due to a change in who is responsible for incrementing the rank of a tensor for batching (the buffer routine or the data sampler).
I don't think we currently have a good answer for this. The keras example produces tensors with a pre-allocated singleton dimension for the sample index. I think this is because the
concatenate
call in__split_batches
used to bearray
.I have more thoughts but must run now.