OutOfMemoryError while training the cross-encoder

hpi-dhc / xmen

✖️MEN - A Modular Toolkit for Cross-Lingual Medical Entity Normalization

Apache License 2.0

23 stars 7 forks source link

OutOfMemoryError while training the cross-encoder #22

Closed kunalr97 closed 11 months ago

kunalr97 commented 11 months ago

train_args = CrossEncoderTrainingArgs(num_train_epochs = 5)

rr = CrossEncoderReranker()
output_dir = f'../outputs/{label2dict[label]}_index/cross_encoder_training/'

rr.fit(
    train_dataset = train,
    val_dataset = val,
    output_dir= output_dir,
    training_args = train_args,
    show_progress_bar = False
)

When i try to train the cross encoder on the BRONCO dataset for prediciting the ICD code for the diagnoses entities. I get this error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 15.77 GiB total capacity; 14.34 GiB already allocated; 379.12 MiB free; 15.03 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried running this line and it does not seem to work. Also there are not any other processes running on the GPU.

import torch
torch.cuda.empty_cache()

Thanks in advance for your help.

phlobo commented 11 months ago

Hello!

The cross-encoder is indeed quite memory intensive (I tested everything with 48GB GPU memory). Two things that might work:

1) I'm not sure if all memory allocated by SapBERT would be cleared by empty_cache(), you might instead want to save the candidate dataset to disk and restart the process / notebook to make sure CUDA memory is entirely freed up.

2) You can reduce the memory footprint of the cross-encoder by reducing the number of candidates subject to re-ranking (which equals the batch size) to something like 16 instead of 64.

phlobo commented 11 months ago

Another thing that might work (though I have not tested the performance), would be to use a smaller BERT model, i.e.,

train_args = CrossEncoderTrainingArgs(model_name="distilbert-base-multilingual-cased")

kunalr97 commented 11 months ago

Hi,
Thanks for your quick response. I will try this and hope that it works. Where exactly do i need to do this ?

2. You can reduce the memory footprint of the cross-encoder by reducing the number of candidates subject to re-ranking (which equals the batch size) to something like 16 instead of 64.

Thanks in advance

phlobo commented 11 months ago

There are multiple steps at which you can reduce the number of candidates. However, if you follow this notebook (https://github.com/hpi-dhc/xmen/blob/main/examples/02_BRONCO.ipynb), then setting K_RERANKING = 16 just before calling CrossEncoderReranker.prepare_data should do the trick.

Note: I assume that this will cost you a bit of recall@1, but it might actually increase precision. To get precision, recall and F1 scores at the end, use evaluate instead of evaluate_at_k

kunalr97 commented 11 months ago

Thanks a lot! I don't get that error now.

phlobo commented 11 months ago

Thank you for pointing this issue out, I have linked this thread in the README