Best-practices for incrementing tensor rank for batch buffering

ejhumphrey commented 7 years ago

I'm having trouble buffering samples into batches, and I think (think?) that it's due to a change in who is responsible for incrementing the rank of a tensor for batching (the buffer routine or the data sampler).

I don't think we currently have a good answer for this. The keras example produces tensors with a pre-allocated singleton dimension for the sample index. I think this is because the concatenate call in __split_batches used to be array.

I have more thoughts but must run now.

cjacoby commented 7 years ago

Yup, I agree, and it's true.

I ran into this while making some examples. (That one, for instance). My solution up to now has been to say in the docs "make sure you yield your samples with the sample dimension first!", putting the responsibility on the user.

A problem here is that streamers (even non-buffered ones!) are allowed to produce multiple "samples" in a "batch". ie

def sample_gen(n_samples=10):
    while True:
        yield { "X": np.random.random((n_samples, 1)) }

streamer = pescador.Streamer(sample_gen, 100)

for batch in streamer.generate():
    # batch['X'].shape == (100, 1)

Additionally, nowhere have we specified that streamers must produce samples with an initial sample-index dimension. This is valid, under current design:

def sample_gen():
    while True:
        yield {"X": np.random.random()}

streamer = pescador.Streamer(sample_gen)

for sample in streamer.generate():
   # sample['X'].shape == (,)
   # do something with the sample

However, you could not Buffer this streamer correctly. To my memory, we have no currently defined way to handle this right now.

ejhumphrey commented 7 years ago

so I'm starting to think this is fundamentally related to #75 ... and that the answer is to add functionality to core.py that defines and / or validates sample-ness and batch-ness. This could be similar mir_eval's consistent use of the "intervals", "boundaries" and "samples" concepts.

cjacoby commented 7 years ago

I generally agree, although perhaps let's pass that discussion back over to #75 to keep it in one place.

ejhumphrey commented 7 years ago

Ha, I was going to migrate it here, but that's fine.

I will say though, related to this comment, the current state of my thoughts says that buffer_batch, while agnostic to what you feed it, should be responsible for adding the sample dimension to an array of items. The design of / relationship between buffer_batch and __split_batches feels wrong, probably partly due to how we've defined (or rather, haven't defined) these concepts.

ejhumphrey commented 7 years ago

closed with #88

pescadores / pescador

Best-practices for incrementing tensor rank for batch buffering #76