Closed fajarmuslim closed 2 years ago
Hi! The current code has no multi-gpu support, as I was training it on a single gpu. You can, however, manually reassign the encoder to a different gpu.
For this:
Here change config.device
to something else. I would add a new key to the config, but you can simply hardcode something like "cuda:1"
here.
In this function, change all mentions of self.config.device
to whatever you used in the previous step.
Finally, in the same function move the output to self.config.device
: \
return out[subword_mask_tensor].to(self.config.device)
Hope this helps. Please let me know how this worked out for you.
Thank you the answer. I try to use the smaller model (roberta-base) to training model with single GPU. It is work for now until 2 epoch and still running. My purpose is to build model for my language (Indonesian). Currently, we only have the BERT-base model. So, I think enough to use single GPU when training models.
If in the future, needed to run in paralel mode, It can be beneficial in decreasing training time. but for now, single GPU is enough to run BERT-base in Indonesian languange.
once again, thank you for the suport. I will close this question...
Hi Vlad,
I am already try to training wl-coref on my own dataset. but, facing issue, which s_loss reach inf number. Idk why this happened. First, I try to assign the avg_spans to 1 if the avg_spans is 0 to prevent dividing by zero at this point
s_loss = (self._span_criterion(res.span_scores[:, :, 0], res.span_y[0])
+ self._span_criterion(res.span_scores[:, :, 1], res.span_y[1])) / avg_spans / 2
but, it doesn't affect anything.
here, I attached the screenshot of logs when training with my indonesian data.
after that, it will raise an error as follows
How to solve this? what is your opinion?
thank you
Hi! Looks like there is an error in your dataset. To be able to properly debug it, you will need to run it on cpu. If this takes too long before there is any sign of error, I would recommend disabling the gradients of the encoder. Most probably after switching to cpu you will get an error that you will know how to work with. Otherwise send it here :)
Thanks for the recommendation. I will continue debugging for right now...
Currently, when running this source code. I have an error cuda running out of memory. Since single GPU have only 32GB memory.
but, in another side, I have access to server which have 8 GPU (each of them having 32GB memory). Can I run this training experiment with the paralel mode?
If it can, how to achieve that?
thanks in advance..