Closed leileilin closed 2 years ago
Here is the following error i met: Epoch 1: bc/cnn/00/cnn_0001 c_loss: 2.11580 s_loss: 0.57502: 14% 394/2802 [01:04<04:54, 8.18docs/s] It seems the training process stopped. can u tell me why? thanks.
the environment i use is more newer: torch 1.8.0 cudatoolkit 11.1 transformers 4.6.1 tokenizers 0.10.1
So you mean it just stopped at document 394 and it won't advance? How long did you wait? Can you show me the stack trace that is output after a keyboard interrupt when it gets stuck?
So you mean it just stopped at document 394 and it won't advance? How long did you wait? Can you show me the stack trace that is output after a keyboard interrupt when it gets stuck?
yes, i waited for so long and it did not advance.
Traceback (most recent call last):
File "/cephfs/linlei/work/wl-coref/run.py", line 86, in
_to_id(token)
i think may be the different version cause that?
Looks like the loop here never exits...
One reason for that might be (that's a wild guess) that there's a sentence that is longer than the bert_window_size
, so the loop variable end
never advances. What bert_window_size
are you using? Can you post here the sent_id
key of the doc
that the model stucks on?
To fix this you might check if after line 32 end
is still on the same sent, and if yes, cut right there. Something like this (I haven't tested this code, check if it works yourself, please):
while end < len(subwords):
start_sent_id = doc["sent_id"][doc["word_id"][start]]
end = min(end + batch_size, len(subwords))
# Move back till we hit a sentence end
if end < len(subwords) and start_sent_id != doc["sent_id"][doc["word_id"][end - 1]]:
sent_id = doc["sent_id"][doc["word_id"][end]]
while end and doc["sent_id"][doc["word_id"][end - 1]] == sent_id:
end -= 1
length = end - start
batch = [tok.cls_token] + subwords[start:end] + [tok.sep_token]
batch_ids = [-1] + list(range(start, end)) + [-1]
# Padding to desired length
# -1 means the token is a special token
batch += [tok.pad_token] * (batch_size - length)
batch_ids += [-1] * (batch_size - length)
subwords_batches.append([tok.convert_tokens_to_ids(token)
for token in batch])
start += length
Looks like the loop here never exits...
One reason for that might be (that's a wild guess) that there's a sentence that is longer than the
bert_window_size
, so the loop variableend
never advances. Whatbert_window_size
are you using? Can you post here thesent_id
key of thedoc
that the model stucks on?
Here I use the window_ Size is 128
When the sentence length exceeds window_ Size, isn't it divided into several parts? According to my understanding after reading the code.
Looks like the loop here never exits...
One reason for that might be (that's a wild guess) that there's a sentence that is longer than the
bert_window_size
, so the loop variableend
never advances. Whatbert_window_size
are you using? Can you post here thesent_id
key of thedoc
that the model stucks on?
doc_id: wb/a2e/00/a2e_0025 sent_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39]
To fix this you might check if after line 32
end
is still on the same sent, and if yes, cut right there. Something like this (I haven't tested this code, check if it works yourself, please):while end < len(subwords): start_sent_id = doc["sent_id"][doc["word_id"][start]] end = min(end + batch_size, len(subwords)) # Move back till we hit a sentence end if end < len(subwords) and start_sent_id != doc["sent_id"][doc["word_id"][end - 1]]: sent_id = doc["sent_id"][doc["word_id"][end]] while end and doc["sent_id"][doc["word_id"][end - 1]] == sent_id: end -= 1 length = end - start batch = [tok.cls_token] + subwords[start:end] + [tok.sep_token] batch_ids = [-1] + list(range(start, end)) + [-1] # Padding to desired length # -1 means the token is a special token batch += [tok.pad_token] * (batch_size - length) batch_ids += [-1] * (batch_size - length) subwords_batches.append([tok.convert_tokens_to_ids(token) for token in batch]) start += length
the dataset i use is the english that you use too.
To fix this you might check if after line 32
end
is still on the same sent, and if yes, cut right there. Something like this (I haven't tested this code, check if it works yourself, please):while end < len(subwords): start_sent_id = doc["sent_id"][doc["word_id"][start]] end = min(end + batch_size, len(subwords)) # Move back till we hit a sentence end if end < len(subwords) and start_sent_id != doc["sent_id"][doc["word_id"][end - 1]]: sent_id = doc["sent_id"][doc["word_id"][end]] while end and doc["sent_id"][doc["word_id"][end - 1]] == sent_id: end -= 1 length = end - start batch = [tok.cls_token] + subwords[start:end] + [tok.sep_token] batch_ids = [-1] + list(range(start, end)) + [-1] # Padding to desired length # -1 means the token is a special token batch += [tok.pad_token] * (batch_size - length) batch_ids += [-1] * (batch_size - length) subwords_batches.append([tok.convert_tokens_to_ids(token) for token in batch]) start += length
I know what's wrong. One sentence in the document is longer than the window_size I set, resulting in an endless loop. Do you have any good suggestions?
One sentence in the document is longer than the window_size I set, resulting in an endless loop.
That's what I thought. Have you tried the piece of code I sent above?
One sentence in the document is longer than the window_size I set, resulting in an endless loop.
That's what I thought. Have you tried the piece of code I sent above? yeah, I got it through debugging
And I found that there is another problem. Your code does not effectively control the batch size to control the GPU's video memory occupation.
There's that, yes. During training, the model always processes just one document at a time, however big or small it is.
There's that, yes. During training, the model always processes just one document at a time, however big or small it is.
But a document is by window_ size to segment into multiple sentences, can you segment multiple sentences and do the forward process according to a fixed number of sentences each time and then we can implement it on gpu with small video memory?
I am sure it is possible. Your pull request will be most welcome
I am sure it is possible. Your pull request will be most welcome
Will you implement this function next? I try to use the for loop to forward propagate fixed sentences, but the memory occupation is the same. Here is my simple code:
subwords_batches_tensor_batches = subwords_batches_tensor.size(0) batch_size = 2 cat_out = torch.tensor([], device=self.config.device) for batch in range(0, subwords_batches_tensor_batches, batch_size): per_subwords_batches_tensor = subwords_batches_tensor[batch : batch + batch_size] per_subwords_batches = subwords_batches[batch : batch + batch_size] per_attention_mask = (per_subwords_batches != self.tokenizer.pad_token_id) per_subword_mask_tensor = subword_mask_tensor[batch : batch + batch_size]
out, _ = self.bert(
per_subwords_batches_tensor,
attention_mask = torch.tensor(
per_attention_mask, device=self.config.device
)
)
del _
cat_out = torch.cat((cat_out, out[per_subword_mask_tensor]), dim=0)
What you are doing may only reduce memory consumption on inference (which is pretty low anyway). During training, all the intermediate states before calling backward() are stored in GPU memory, so what you might want to do is simply cut all the training documents into smaller Doc objects before training and use that dataset to train. Otherwise you'll need to rewrite the whole training logic
What you are doing may only reduce memory consumption on inference (which is pretty low anyway). During training, all the intermediate states before calling backward() are stored in GPU memory, so what you might want to do is simply cut all the training documents into smaller Doc objects before training and use that dataset to train. Otherwise you'll need to rewrite the whole training logic
In fact, neither of the two methods you mentioned is easy to change.
And there is another key problem is that if the above-mentioned abnormal conditions are not handled, it will seriously affect its practicality. I don't know whether to directly divide a sentence into different sub sentences according to the set size or discard it directly?
Can you paraphrase that a bit? I am not sure I understand what you mean
Can you paraphrase that a bit? I am not sure I understand what you mean
I mean when you encounter the special situation mentioned above, that is, the length of a sentence exceeds window_ size, your code can't handle this exception. What I can think of is to truncate or discard this sentence. Do you have any better suggestions?
I see. I suggest you just cut in the middle of the sentence like in the piece of code I attached. This way you will not lose any data, even though the representations will be less reliable.
I see. I suggest you just cut in the middle of the sentence like in the piece of code I attached. This way you will not lose any data, even though the representations will be less reliable.
thanks, I solved it in a truncated way.
Another problem is bert-base-chinese_chinese_train_head.jsonlines.pickle stands for?
That's a cache file, it's safe to delete it
Here is the following error i met: Epoch 1: bc/cnn/00/cnn_0001 c_loss: 2.11580 s_loss: 0.57502: 14% 394/2802 [01:04<04:54, 8.18docs/s] It seems the training process stopped. can u tell me why? thanks.