Questions about training

vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

MIT License

104 stars 37 forks source link

Questions about training #9

Closed fajarmuslim closed 2 years ago

fajarmuslim commented 2 years ago

Currently, when running this source code. I have an error cuda running out of memory. Since single GPU have only 32GB memory.

but, in another side, I have access to server which have 8 GPU (each of them having 32GB memory). Can I run this training experiment with the paralel mode?

If it can, how to achieve that?

thanks in advance..

vdobrovolskii commented 2 years ago

Hi! The current code has no multi-gpu support, as I was training it on a single gpu. You can, however, manually reassign the encoder to a different gpu.

For this: Here change config.device to something else. I would add a new key to the config, but you can simply hardcode something like "cuda:1" here. In this function, change all mentions of self.config.device to whatever you used in the previous step. Finally, in the same function move the output to self.config.device: \ return out[subword_mask_tensor].to(self.config.device)

Hope this helps. Please let me know how this worked out for you.

fajarmuslim commented 2 years ago

Thank you the answer. I try to use the smaller model (roberta-base) to training model with single GPU. It is work for now until 2 epoch and still running. My purpose is to build model for my language (Indonesian). Currently, we only have the BERT-base model. So, I think enough to use single GPU when training models.

If in the future, needed to run in paralel mode, It can be beneficial in decreasing training time. but for now, single GPU is enough to run BERT-base in Indonesian languange.

once again, thank you for the suport. I will close this question...

fajarmuslim commented 2 years ago

Hi Vlad,

I am already try to training wl-coref on my own dataset. but, facing issue, which s_loss reach inf number. Idk why this happened. First, I try to assign the avg_spans to 1 if the avg_spans is 0 to prevent dividing by zero at this point

s_loss = (self._span_criterion(res.span_scores[:, :, 0], res.span_y[0])
                              + self._span_criterion(res.span_scores[:, :, 1], res.span_y[1])) / avg_spans / 2

but, it doesn't affect anything.

here, I attached the screenshot of logs when training with my indonesian data.

after that, it will raise an error as follows

How to solve this? what is your opinion?

thank you

vdobrovolskii commented 2 years ago

Hi! Looks like there is an error in your dataset. To be able to properly debug it, you will need to run it on cpu. If this takes too long before there is any sign of error, I would recommend disabling the gradients of the encoder. Most probably after switching to cpu you will get an error that you will know how to work with. Otherwise send it here :)

fajarmuslim commented 2 years ago

Thanks for the recommendation. I will continue debugging for right now...