Open 0dB opened 1 year ago
I'm not necessarily an expert but I have some intuition why this might happen: Let's say you have a very small batch size. This means there is only very few samples in your batch. Now you basically start optimizing per batch. But since there is only little entropy in each batch, there is a high variance and hence your training procedure will not converge nicely. Does that make sense?
PS: This image somewhat summarizes it:
Hm, from looking at the code, I think that in nanoGPT batch_size
is actually the number of batches and not the size of the batches, instead, block_size
seems to be the size of the batches (how much data is in a batch). If I am right, the explanation you give would make sense for the block_size
. This is a bit confusing, I think. Do you read the function get_batch()
in train.py
like that too?
You might want to watch https://youtu.be/kCc8FmEb1nY?t=867. This might clarify a few things.
Right, that is the part of the videos I am referring to in my original question. Andrej says the blocks (batches) are "completely independent" and "don't talk to each other", but this is not the experience I have and I am wondering what is going on…
The different batches don't talk to each other means that the model parameters are optimized per batch.
block_size ^= context_length ^= length of a training chunk
batch_size ^= number of blocks (a.k.a. chunks) per training batch
Revisiting the example in the video: If we have block_size = 8
and batch_size = 4
, this actually makes up 8
training samples per block and 32 = 4 * 8
training samples per batch. The reason for this is that the transformer even without batching will process block_size
samples in parallel (i.e., all the substrings block[0:i]
for i in range(1,9)
, with targets block[i]
).
(Note that there is also the more advanced method of micro-batching (a.k.a. averaging over multiple batches) if you are running on multiple GPUs, but let's not look at that for now for the sake of simplicity.)
Let me get back to you on that after I did some more code checking. Just one quick remark on what you wrote, and one more quote from Andrej:
The reason for this is that the transformer even without batching will process block_size samples in parallel (i.e., all the substrings block[0:i] for i in range(1,9), with targets block[i]).
Right, but what you are mentioning here is the T dimension (context size, i. e. size of the "window", i. e. size of one chunk, as you wrote), but I think the answer lies somewhere how the B dimension (number of blocks per batch) is handled. And there is already one clear case where the batches are not independent: As Andrej explains in the video and as can be seen in the function estimate_loss()
, the loss is averaged over all batches. So that is one thing where the blocks are connected after all, but I don't know yet if that can explain the effect I am seeing.
And, another tangent on my original question: Andrej says in the video that the batches (blocks) are only so that the GPU is kept busy, so to me that would imply that convergence should not be dependent on whether I have 32 blocks run at once in one batch or one block per batch, so in this example 32 blocks run in 32 batches, one per batch. Would you say so too?
But I indeed see a difference regarding convergence when I change batch_size
(i. e. number of blocks per batch). I ran a few dozen tests over the course of a few weeks, all running for hours, on a 5MB Shakespeare data set (all the plays, not the 1MB data set Andrej used), and while tuning the hyperparameters I noticed this effect and zeroed in on it. I got to the point where changing batch_size
by one went from overfitting with 16, to underfitting with 17, already clear after about 4000 steps of planned 40000 steps.
I will study the code some more.
I think you might wanna look at how which point the backprop optimization actually changes the parameters. This is done after each batch.
(Also note that batch != block.)
I am still baffled. I still only see one factor, that is that the loss is calculated over the B T dimensions (`batch_size block_size`), but changing B by one should not have the effect that it does (in my case going from 16 to 17 and then going from overfitting to massively underfitting), I would think, because, when picking a slightly bigger B I am essentially also just reducing noise and shouldn't have a "weaker model". Could this have to do with the fixed seed and that the one more block that is then included is randomly adding features that are accidentally destroying things? But why would this be for a whole bunch of batches in a row, that is hard to believe.
This is the code that I peeled out, all the function calls that seem to be relevant:
In train.py
X, Y = get_batch('train') # X and Y are both of shape [B, T]
# forward pass. loss is scalar, logits are [B, T, C], C is vocab size, for me currently [16, 512, 50304]
# so loss is calculated over [B, T]; see other comment
# effectively calling GPT.forward() from model.py
logits, loss = model(X, Y)
# set the gradients. For my setup, (no GradScaler since bfloat16 and not float16, and not cuda,
# mps instead), this is effectively loss.backward()
scaler.scale(loss).backward()
# apply the gradients. The next two steps are effectively optimizer.step().
# All parameters that have a gradient set are updated.
scaler.step(optimizer)
scaler.update()
# reset the gradients so PyTorch does not accumulate them
optimizer.zero_grad(set_to_none=True)
In model.py, in GPT.forward()
# view on logits is [B * T, C], in my case [8192, 50304]
# loss is determined over B * T, which was already clear from the result in train.py
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
(BTW in a previous post I pointed out the wrong code concerning the loss being evaluated over all blocks in a
batch. That code was for the eval_interval
where model is in eval mode and evaluates over all eval_iters
batches. But indeed the loss that I am looking at is also determined over all of [B, T].)
Edits: I just looked up the documentation for cross_entropy()
and see that if defaults to mean
so no issue there either. And I wonder what ignore_index=-1
does, since the default is -100
(see also #297). BTW targets.view(-1).shape
is [B * T].
wow, in this case i am also pretty much out of explanations. of course you can try to run with different seed. or maybe batch size has to be a multiple of something? idk
@0dB is there any explanation that you found for batch size affecting convergence? For me in general it is not clear what batch size changes except faster computation on GPU (MPS in my case)
It is my understanding from the videos that batch size should have no influence on convergence. But I have cases where increasing the batch size will lead to underfitting, or where decreasing the batch size will lead to overfitting. Is there an explanation for this? Am I getting something wrong? (AFAIK batch normalization could have this effect, but not layer normalization.)