Closed lfoppiano closed 9 months ago
I've extended the support to the other scripts of the sequence labelling. Overall, I'm not sure how useful this feature is for increasing the batch_size, because the performances are not improving by increasing it on fine-tuning 😠On the other hand it's definitely nice to have when testing big BERT models because it allow keeping the same parameters (e..g batch_size=20) that would not be possible without multi-gpu.
For example in previous of my tests, the batteryonlybert
https://huggingface.co/batterydata/batteryonlybert-cased had better results, but it was due to the use of batch_size=10 to overcome the OOM happening with batch_size=20. Fine-tuning with --multi-gpu
and batch_size=20
resulted in lower scores.
Thank you @lfoppiano ! I was not able to test with a multi GPU settings so just tested with normal single GPU, which is working fine as expected.
I think this is useful as you say for larger models (keeping same batch size), but also for prediction because we can increase the batch size and process more rapidly texts.
Doing more tests, training is fine but there is failure when writing a model with the -multi-gpu
option, having one GPU:
python3 delft/applications/grobidTagger.py date train_eval --architecture BidLSTM_CRF --embedding glove-840B --multi-gpu
....
_________________________________________________________________
f1 (micro): 95.78
precision recall f1-score support
<day> 0.9091 0.9524 0.9302 42
<month> 0.9344 0.9661 0.9500 59
<year> 1.0000 0.9688 0.9841 64
all (micro avg.) 0.9521 0.9636 0.9578 165
model config file saved
preprocessor save
model saved
Exception ignored in: <function Pool.__del__ at 0x7f3fa42ccb80>
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
self._change_notifier.put(None)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
without -multi-gpu
this error does not appear and the model is saved.
Of course, using -multi-gpu
when having a single GPU might not be very consistent as a user action!
@kermitt2 thanks for testing it. I will add the option for the inference too.
The OSError: [Errno 9] Bad file descriptor
seems due to https://github.com/tensorflow/tensorflow/issues/50487 and should be fixed in e169867f6911bb934147077beb0f1dcab4ca7a19
This PR adds the support to multi-gpu, it has been tested on multiple GPU on the same node (4 x 16Gb) allowing to provide a larger batch size.
I wanted to implement it in the trainer but the processing related to the data preparation require to to be under the
with strategy.scope():
.I implemented only on the sequence labelling, once it's revised I update the classification as well.