Question regarding batch_size and video_chunks_length

hongzimao / pensieve

Neural Adaptive Video Streaming with Pensieve (SIGCOMM '17)

http://web.mit.edu/pensieve/

MIT License

524 stars 280 forks source link

Question regarding batch_size and video_chunks_length #102

Open mkanakis opened 4 years ago

mkanakis commented 4 years ago

Hi,

First of all great work! I really enjoyed the paper. And thanks for providing the implementation details publicly!

I am trying to reuse your implementation in a slightly different context as a student project and I came here reading the issues because I have some questions as well.

You mention here: #62 that video chunks are bounded in the range [20, 100]. Does that mean you train with variable length batch size? Also the batch_size in the code is defined as 100.

I have two concerns about this:

Isn't it the case that when you train with less than the specified batch_size you are actually putting more weight to that particular states?
Reading here and I quote:

The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function.

Do you think that in your case the batch size affected generalizing well? I was also reading somewhere that

Training with large minibatches is bad for your health. More importantly, it's bad for your test error. Friends dont let friends use minibatches larger than 32. https://arxiv.org/abs/1804.07612 - Yann LeCun

Which is quite a smaller value than the 100 defined now.

Thanks in advance, Marios.

hongzimao commented 4 years ago

Hi Marios,

Thanks for your interest in our work! Just to make sure we talk about the same thing, IIRC, each policy gradient step takes in num_parallel_workers * num_mdp_steps amount of (state, action, reward) data. The num_mdp_steps is bounded within [20, 100] by the video_chunks you mentioned.

That said, the "batch size" would definitely be larger than the number 32, above which "friends won't let friends set" :) So two takes from me: (1) we didn't tune those parameters extensively and systematically tuning the parameters might give better results -- we have seen better generalization/other performance reported by others many times. (2) I don't quite see a problem of large batch size, other than making the training slow. Large batch typically gives gradient estimation with higher accuracy. If anything, it should generally results in more stable training if we run for the same number of gradient steps with larger batch size. On the other hand, if we talk about the training performance given the same amount of compute and same amount of time, it will be a different story setting different batch sizes and you might want to empirically tune the parameter to see the outcome.

Hope these help!

mkanakis commented 4 years ago

Thanks for the quick reply @hongzimao,

As I understand you left fine-tuning and tweaking for the next to come which is completely fine. I can also see your take on (2) and I guess it depends on the perspective.

However, what is still not quite clear to me is the following:

You did train on multiple videos right? Then, I assume from the code on agent.py for example that each video represents an epoch.

So in that case, the bounded video_chunks per video could be for example:

In all of those cases, the end_of_video has facilitated an update_step since batch_size < 100.

What I am trying to get, is that every training_step or epoch has had a variable-based batch_size and not a constant fixed size.

Is that correct? Or am I missing something? And if it is correct, do you know if it affects training in any way?

My question stems from the fact that I have never seen anything similar before in training.

Kind regards, Marios.

hongzimao commented 4 years ago

Your understanding is correct. Different workers can contribute experience data with different length. To compute gradient, what we operationally did was padding (so they all have the same shape to make a rectangular batch) -- the reward and value prediction data was initialized as zero vector, so the advantage term for the missing indices would be 0 and so the gradient would be zeroed out.

This will add some noise to the loss computation at each step. Although we didn't do this, but we should have computed an average loss over the data that exists (i.e., divide the loss by 64 if sequence length is 64, by 100 for length 100, etc.). This will make the loss calculation unbiased regardless of input size. However, variable length can make the variance of loss different, which might affect the stability of training.

It would be interesting to come up with training methods (e.g., by adapting the learning rate, or the momentum and others in an optimizer) that combat the variance introduced by the variable length training data. In fact, the source of variance can stem from different data types (e.g., network trace with different bandwidth), even when the experience lengths are the same. Taking these kinds of auxiliary information into account may make the training more stable. It would be an interesting research direction to pursue I think.

mkanakis commented 4 years ago

Thanks for replying,

So indeed that confirms my initial idea, glad to know!

Kind regards, Marios .