nlpyang / PreSumm

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders
MIT License
1.28k stars 463 forks source link

❓ Confused about `batch_size` #8

Closed astariul closed 5 years ago

astariul commented 5 years ago

I'm having difficulties to wrap my head around the batch_size parameter.

What exactly is the batch_size parameter ?

It's not the real batch size (i.e. how many samples can be processed at once).
So what is it exactly ? And how can I choose the real batch size from this argument ?

astariul commented 5 years ago

As I understood so far :

Each step, the Trainer class run accum_count mini-batch. For each mini-batch, there is X samples, where X vary following :

X * max(len(x) for x in X) <= batch_size

Please let me know if I understood right !


I think some comments on this function :

https://github.com/nlpyang/PreSumm/blob/fa69433dffa61e3dffc13bc299f7f949fcf305da/src/models/data_loader.py#L112-L124

might be helpful !

nlpyang commented 5 years ago

For extractive, batch_size is maximum number of sentences in the source document For abstractive, batch_size is maximum number of tokens in the target summary

It is designed to use the memory more effectively.

jdh3577 commented 5 years ago

Then, what is the exactly number of real(original mean) batch size?

astariul commented 5 years ago

Batchsize (by its traditional definition) is not a fixed number here. This is designed to use the memory much more efficiently than using a fixed number.

nlpyang commented 5 years ago

@jdh3577 As I said, this is dynamic during training.

Shashi456 commented 4 years ago

@nlpyang would decreasing the batch size to 512 in Extractive summarization affect performance?

gaozhiguang commented 3 years ago

Hi, Does the batchsize here have something to do with the number of GPU, it uses the distributed training, how does the model update its parameters? is it all gpus gradient merge and then update?

gaozhiguang commented 3 years ago

if self.grad_accum_count > 1: if self.n_gpu > 1: grads = [p.grad.data for p in self.model.parameters() if p.requires_grad and p.grad is not None] distributed.all_reduce_and_rescale_tensors( grads, float(1)) for o in self.optims: o.step() maybe the code here

pretidav commented 1 year ago

@nlpyang could you please shed some light on the meaning of this parameter, it clearly isn't the number of documents in the batch, but something related to the number of word-pieces multiplied by a funny factor 300. Is the latter a typo or a magic number inserted on purpose ? Thanks