Closed taghreed34 closed 2 years ago
Hi,
A single batch taking an hour sounds very unusual. Is GPU properly enabled? But it seems too long even with CPU... I suspect that it has something to do with the setup of Google Colab GPU but have no idea about what is going on there.
@Ryou0634 Yes the GPU is enabled, or the training task wouldn't have started from the beginning, it would have raise an error. But anyways I changed runtime type to GPU before running the notebook. Do I misunderstand something?
OK, could you share the notebook or the code snippet if possible?
I tried running the training on Google Colab myself and with my code and data, I observed no problems.
I recommend installing the dependency via poetry
like this.
!pip install -q --pre poetry
!poetry --version
!git clone https://github.com/studio-ousia/luke.git
cd luke
!poetry install
Just using pip
can install the wrong versions of the libraries.
This is my notebook for reference (place the data as necessary). https://colab.research.google.com/drive/1r6YJaMHRgfOCVhQ8vDABvcYct1-Od7KH?usp=sharing
It still says no module named (Wikipedia2vec, ujson, allennlp, allennlp_models, and google.cloud.storage.retry) after using poetry. @Ryou0634
Does poetry install
seem successful?
If so, how about running poetry run allennlp train ...
? (add poetry run
before your training command)
Not for all packages, there are some of them with a "pending", "downloading" or "say 40%". @Ryou0634
I tried to install unfound modules with "poetry run pip3 install ..." then I ran training command with poetry run in the beginning, now the problem in cuda out of memory, this doesn't make sense.
Yes, I am not so familiar with Google Colab, but I think they assign a different machine to you every session. That may be why you got the running-so-slow problem and now it is OOM.
As the training process is quite computationally intensive, I recommend running it in a more stable and powerful environment if possible.
Okay I'll do that as soon as possible.
Thank you so much for your time and effort.
Hello there,
I'm trying to finetune LUKE large for NER task using data with conll format from multiple sources including conll2003 itself, I've combined them together. The training is being done on Google Colab GPU, but it takes so long to finish a single batch, it took about 2 hours to train on just 2 batches with a batch size of 2, is this expected? And if not, why does this happen?
Thanks in advance