No speed up when using larger batch size

I run this code on a 2xTesla p100 machine with pytorch1.5. And the training time keeps the same no matter how much I set the batch size. For example, if I set bs to 4, then it took one second per-iteration; and if I set bs to 16, it took 4 seconds per-iteration. Shouldn't it still be 1 second? Is there something wrong with the multiprocessing part?