Training with multiple GPUs

lfoppiano commented 1 year ago

This PR adds the support to multi-gpu, it has been tested on multiple GPU on the same node (4 x 16Gb) allowing to provide a larger batch size.

I wanted to implement it in the trainer but the processing related to the data preparation require to to be under the with strategy.scope():.

I implemented only on the sequence labelling, once it's revised I update the classification as well.

lfoppiano commented 1 year ago

I've extended the support to the other scripts of the sequence labelling. Overall, I'm not sure how useful this feature is for increasing the batch_size, because the performances are not improving by increasing it on fine-tuning 😭 On the other hand it's definitely nice to have when testing big BERT models because it allow keeping the same parameters (e..g batch_size=20) that would not be possible without multi-gpu.

For example in previous of my tests, the batteryonlybert https://huggingface.co/batterydata/batteryonlybert-cased had better results, but it was due to the use of batch_size=10 to overcome the OOM happening with batch_size=20. Fine-tuning with --multi-gpu and batch_size=20 resulted in lower scores.

kermitt2 commented 1 year ago

Thank you @lfoppiano ! I was not able to test with a multi GPU settings so just tested with normal single GPU, which is working fine as expected.

I think this is useful as you say for larger models (keeping same batch size), but also for prediction because we can increase the batch size and process more rapidly texts.

kermitt2 commented 1 year ago

Doing more tests, training is fine but there is failure when writing a model with the -multi-gpu option, having one GPU:

python3 delft/applications/grobidTagger.py date train_eval --architecture BidLSTM_CRF --embedding glove-840B --multi-gpu

....

_________________________________________________________________
    f1 (micro): 95.78
                  precision    recall  f1-score   support

           <day>     0.9091    0.9524    0.9302        42
         <month>     0.9344    0.9661    0.9500        59
          <year>     1.0000    0.9688    0.9841        64

all (micro avg.)     0.9521    0.9636    0.9578       165

model config file saved
preprocessor save
model saved
Exception ignored in: <function Pool.__del__ at 0x7f3fa42ccb80>
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

without -multi-gpu this error does not appear and the model is saved.

Of course, using -multi-gpu when having a single GPU might not be very consistent as a user action!

lfoppiano commented 1 year ago

@kermitt2 thanks for testing it. I will add the option for the inference too.

lfoppiano commented 1 year ago

The OSError: [Errno 9] Bad file descriptor seems due to https://github.com/tensorflow/tensorflow/issues/50487 and should be fixed in e169867f6911bb934147077beb0f1dcab4ca7a19

kermitt2 / delft

Training with multiple GPUs #164