Closed kallewoof closed 5 months ago
I'm not sure this works in all cases. I'll have to think about it closely. The potential issue is that because of the pipeline parallelism, all micro-batches that are being pipelined as part of one training step, need to be exactly the same sized tensors (Deepspeed requirement, it reuses buffers and such). The DistributedBatchSampler is making global batches that are sized correctly, so that it returns the indices needed by each data-parallel instance, and then the code further slices that into the micro-batches. There's a multiplier passed in so that these global batches can always be evenly divided.
I always assumed that fixing this properly would require padding the incomplete batch with some duplicated examples to ensure that things are always a fixed size. Let me go through your other PRs first then I'll come back to this one.
Hm, I see! For what it's worth, my training session is not using the max tokens for each batch at all, and it seems to be working. E.g.
before GAS splitting, batch size: 1, total tokens: 1664
[2024-05-10 11:39:52,768] [INFO] [logging.py:96:log_dist] [Rank 0] step=1299, skipped=0, lr=[5e-05], mom=[(0.9, 0.99)]
steps: 1299 loss: 1.6742 iter time (s): 16.656 samples/sec: 0.060
before GAS splitting, batch size: 2, total tokens: 4992
[2024-05-10 11:40:30,097] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=0, lr=[5e-05], mom=[(0.9, 0.99)]
steps: 1300 loss: 1.9591 iter time (s): 36.801 samples/sec: 0.027
before GAS splitting, batch size: 1, total tokens: 1664
[2024-05-10 11:40:47,319] [INFO] [logging.py:96:log_dist] [Rank 0] step=1301, skipped=0, lr=[5e-05], mom=[(0.9, 0.99)]
steps: 1301 loss: 2.2441 iter time (s): 16.674 samples/sec: 0.060
before GAS splitting, batch size: 1, total tokens: 3712
Maybe I'm not reading what you're saying correctly. (It sounds like you're saying total_tokens
must be 4096 at all times.)
All examples in the global batch (i.e., the effective batch on which the single training step is taken) are padded so they are the same length. It's okay if that length is some arbitrary number less than the sequence length.
It's working for you because you have gradient_accumulation_steps=1. It might even work with gradient_accumulation_steps > 1 as long as you have no pipeline parallelism, only data parallelism. The problem is, let's say you are using pipeline parallelism, and gradient_accumulation_steps=2, and global batch size is normally 4. So it divides that in half, two micro batches of batch_size 2 each. Now if there is an incomplete global batch of length 3, you will get a micro batch of size 2 and another micro batch of size 1 (assuming some other code that makes assumptions has not already crashed by this point). Now deepspeed definitely crashes, since your micro batches that you are pipelining aren't the same size.
Can we add empty batches to fill it out to fit the needs of deepspeed? It seems wasteful to throw away those samples just because they won't align.
Edit: not sure what training on empty context does to a model though. Also, it should be noted that the problem that this solves (aside from the throwing away stuff problem) is the case when there are not enough samples to even generate ONE eval batch, which results in out of bounds errors due to global_batch being [] and its 0th entry being referenced.
The "global" batch is sliced into micro batches. For a single training step, all of those micro batches (numbering gradient_accumulation_steps) have to be the same sized tensor, both in batch and sequence dimension. So, either you are throwing away the last incomplete global batch, or else filling that batch with something so that the tensor is the correct size.
It might be possible to mask off the loss for whatever you pad that tensor with. But the easiest thing to do, I think, will be to simply pad the last incomplete batch with duplicated examples randomly taken from the dataset. It's not ideal, but it's not too terribly incorrect to do that. For training, a very small number of examples would be trained on twice. For eval, a larger fraction might be evaluated twice, since the eval set will typically be much smaller. But I think this is an okay tradeoff, as it doesn't throw away any data and will allow evaluation to not crash even for arbitrarily small eval datasets. I just need to find the cleanest way to do this and test it with all the different configuration options like multiple GPUs, gradient accumulation steps, etc. I can probably have it done sometime this weekend.
All right. I've reduced the code change to only add the incomplete batch if the global batch is zero length. I believe this addresses your concerns and will plug #2. Regardless, the code will crash without it, so it's at least a net positive, even if it doesn't solve all edge cases.
Edit: padding with random examples from the dataset sounds ideal yeah. As you point out, the one caveat is with very small eval sets, where the same sample might appear repeatedly (3+ times), which would degrade the quality of the evaluations, for obvious reasons. Another idea might be to tweak the batch size so that the final entry has the right amount of samples in it, but as noted, this does not work for tiny eval sets.
Fixes #2. I think.