about the training process

leileilin commented 2 years ago

Here is the following error i met: Epoch 1: bc/cnn/00/cnn_0001 c_loss: 2.11580 s_loss: 0.57502: 14% 394/2802 [01:04<04:54, 8.18docs/s] It seems the training process stopped. can u tell me why? thanks.

leileilin commented 2 years ago

Here is the following error i met: Epoch 1: bc/cnn/00/cnn_0001 c_loss: 2.11580 s_loss: 0.57502: 14% 394/2802 [01:04<04:54, 8.18docs/s] It seems the training process stopped. can u tell me why? thanks.

the environment i use is more newer: torch 1.8.0 cudatoolkit 11.1 transformers 4.6.1 tokenizers 0.10.1

vdobrovolskii commented 2 years ago

So you mean it just stopped at document 394 and it won't advance? How long did you wait? Can you show me the stack trace that is output after a keyboard interrupt when it gets stuck?

leileilin commented 2 years ago

So you mean it just stopped at document 394 and it won't advance? How long did you wait? Can you show me the stack trace that is output after a keyboard interrupt when it gets stuck?

yes, i waited for so long and it did not advance. Traceback (most recent call last): File "/cephfs/linlei/work/wl-coref/run.py", line 86, in model.train() File "/cephfs/linlei/work/wl-coref/coref/coref_model.py", line 297, in train res = self.run(doc) File "/cephfs/linlei/work/wl-coref/coref/coref_model.py", line 218, in run words, cluster_ids = self.we(doc, self._bertify(doc)) File "/cephfs/linlei/work/wl-coref/coref/coref_model.py", line 333, in _bertify subwords_batches = bert.get_subwords_batches(doc, self.config, File "/cephfs/linlei/work/wl-coref/coref/bert.py", line 49, in get_subwords_batches subwords_batches.append([tok.convert_tokens_to_ids(token) File "/cephfs/linlei/work/wl-coref/coref/bert.py", line 49, in subwords_batches.append([tok.convert_tokens_to_ids(token) File "/cephfs/linlei/work/anaconda3/envs/wl-coref-adavance/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 235, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) File "/cephfs/linlei/work/anaconda3/envs/wl-coref-adavance/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 243, in _convert_token_to_id_with_added_voc index = self._tokenizer.token_to_id(token) KeyboardInterrupt

leileilin commented 2 years ago

_to_id(token)

i think may be the different version cause that?

vdobrovolskii commented 2 years ago

Looks like the loop here never exits...

One reason for that might be (that's a wild guess) that there's a sentence that is longer than the bert_window_size, so the loop variable end never advances. What bert_window_size are you using? Can you post here the sent_id key of the doc that the model stucks on?

vdobrovolskii commented 2 years ago

To fix this you might check if after line 32 end is still on the same sent, and if yes, cut right there. Something like this (I haven't tested this code, check if it works yourself, please):

    while end < len(subwords):
        start_sent_id = doc["sent_id"][doc["word_id"][start]]
        end = min(end + batch_size, len(subwords))

        # Move back till we hit a sentence end
        if end < len(subwords) and start_sent_id != doc["sent_id"][doc["word_id"][end - 1]]:
            sent_id = doc["sent_id"][doc["word_id"][end]]
            while end and doc["sent_id"][doc["word_id"][end - 1]] == sent_id:
                end -= 1

        length = end - start
        batch = [tok.cls_token] + subwords[start:end] + [tok.sep_token]
        batch_ids = [-1] + list(range(start, end)) + [-1]

        # Padding to desired length
        # -1 means the token is a special token
        batch += [tok.pad_token] * (batch_size - length)
        batch_ids += [-1] * (batch_size - length)

        subwords_batches.append([tok.convert_tokens_to_ids(token)
                                 for token in batch])
        start += length

leileilin commented 2 years ago

Looks like the loop here never exits...

One reason for that might be (that's a wild guess) that there's a sentence that is longer than the bert_window_size, so the loop variable end never advances. What bert_window_size are you using? Can you post here the sent_id key of the doc that the model stucks on?

Here I use the window_ Size is 128

leileilin commented 2 years ago

When the sentence length exceeds window_ Size, isn't it divided into several parts? According to my understanding after reading the code.

leileilin commented 2 years ago

Looks like the loop here never exits...

One reason for that might be (that's a wild guess) that there's a sentence that is longer than the bert_window_size, so the loop variable end never advances. What bert_window_size are you using? Can you post here the sent_id key of the doc that the model stucks on?

doc_id: wb/a2e/00/a2e_0025 sent_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39]

leileilin commented 2 years ago

To fix this you might check if after line 32 end is still on the same sent, and if yes, cut right there. Something like this (I haven't tested this code, check if it works yourself, please):

    while end < len(subwords):
        start_sent_id = doc["sent_id"][doc["word_id"][start]]
        end = min(end + batch_size, len(subwords))

        # Move back till we hit a sentence end
        if end < len(subwords) and start_sent_id != doc["sent_id"][doc["word_id"][end - 1]]:
            sent_id = doc["sent_id"][doc["word_id"][end]]
            while end and doc["sent_id"][doc["word_id"][end - 1]] == sent_id:
                end -= 1

        length = end - start
        batch = [tok.cls_token] + subwords[start:end] + [tok.sep_token]
        batch_ids = [-1] + list(range(start, end)) + [-1]

        # Padding to desired length
        # -1 means the token is a special token
        batch += [tok.pad_token] * (batch_size - length)
        batch_ids += [-1] * (batch_size - length)

        subwords_batches.append([tok.convert_tokens_to_ids(token)
                                 for token in batch])
        start += length

the dataset i use is the english that you use too.

leileilin commented 2 years ago

To fix this you might check if after line 32 end is still on the same sent, and if yes, cut right there. Something like this (I haven't tested this code, check if it works yourself, please):

    while end < len(subwords):
        start_sent_id = doc["sent_id"][doc["word_id"][start]]
        end = min(end + batch_size, len(subwords))

        # Move back till we hit a sentence end
        if end < len(subwords) and start_sent_id != doc["sent_id"][doc["word_id"][end - 1]]:
            sent_id = doc["sent_id"][doc["word_id"][end]]
            while end and doc["sent_id"][doc["word_id"][end - 1]] == sent_id:
                end -= 1

        length = end - start
        batch = [tok.cls_token] + subwords[start:end] + [tok.sep_token]
        batch_ids = [-1] + list(range(start, end)) + [-1]

        # Padding to desired length
        # -1 means the token is a special token
        batch += [tok.pad_token] * (batch_size - length)
        batch_ids += [-1] * (batch_size - length)

        subwords_batches.append([tok.convert_tokens_to_ids(token)
                                 for token in batch])
        start += length

I know what's wrong. One sentence in the document is longer than the window_size I set, resulting in an endless loop. Do you have any good suggestions?

vdobrovolskii commented 2 years ago

One sentence in the document is longer than the window_size I set, resulting in an endless loop.

That's what I thought. Have you tried the piece of code I sent above?

leileilin commented 2 years ago

One sentence in the document is longer than the window_size I set, resulting in an endless loop.

That's what I thought. Have you tried the piece of code I sent above? yeah, I got it through debugging

leileilin commented 2 years ago

And I found that there is another problem. Your code does not effectively control the batch size to control the GPU's video memory occupation.

vdobrovolskii commented 2 years ago

There's that, yes. During training, the model always processes just one document at a time, however big or small it is.

leileilin commented 2 years ago

There's that, yes. During training, the model always processes just one document at a time, however big or small it is.

But a document is by window_ size to segment into multiple sentences, can you segment multiple sentences and do the forward process according to a fixed number of sentences each time and then we can implement it on gpu with small video memory?

vdobrovolskii commented 2 years ago

I am sure it is possible. Your pull request will be most welcome

leileilin commented 2 years ago

I am sure it is possible. Your pull request will be most welcome

Will you implement this function next? I try to use the for loop to forward propagate fixed sentences, but the memory occupation is the same. Here is my simple code:

subwords_batches_tensor_batches = subwords_batches_tensor.size(0) batch_size = 2 cat_out = torch.tensor([], device=self.config.device) for batch in range(0, subwords_batches_tensor_batches, batch_size): per_subwords_batches_tensor = subwords_batches_tensor[batch : batch + batch_size] per_subwords_batches = subwords_batches[batch : batch + batch_size] per_attention_mask = (per_subwords_batches != self.tokenizer.pad_token_id) per_subword_mask_tensor = subword_mask_tensor[batch : batch + batch_size]

out, _ = self.bert(
    per_subwords_batches_tensor,
    attention_mask = torch.tensor(
        per_attention_mask, device=self.config.device
    )
)

del _

cat_out = torch.cat((cat_out, out[per_subword_mask_tensor]), dim=0)

vdobrovolskii commented 2 years ago

What you are doing may only reduce memory consumption on inference (which is pretty low anyway). During training, all the intermediate states before calling backward() are stored in GPU memory, so what you might want to do is simply cut all the training documents into smaller Doc objects before training and use that dataset to train. Otherwise you'll need to rewrite the whole training logic

leileilin commented 2 years ago

What you are doing may only reduce memory consumption on inference (which is pretty low anyway). During training, all the intermediate states before calling backward() are stored in GPU memory, so what you might want to do is simply cut all the training documents into smaller Doc objects before training and use that dataset to train. Otherwise you'll need to rewrite the whole training logic

In fact, neither of the two methods you mentioned is easy to change.

leileilin commented 2 years ago

And there is another key problem is that if the above-mentioned abnormal conditions are not handled, it will seriously affect its practicality. I don't know whether to directly divide a sentence into different sub sentences according to the set size or discard it directly?

vdobrovolskii commented 2 years ago

Can you paraphrase that a bit? I am not sure I understand what you mean

leileilin commented 2 years ago

Can you paraphrase that a bit? I am not sure I understand what you mean

I mean when you encounter the special situation mentioned above, that is, the length of a sentence exceeds window_ size, your code can't handle this exception. What I can think of is to truncate or discard this sentence. Do you have any better suggestions?

vdobrovolskii commented 2 years ago

I see. I suggest you just cut in the middle of the sentence like in the piece of code I attached. This way you will not lose any data, even though the representations will be less reliable.

leileilin commented 2 years ago

I see. I suggest you just cut in the middle of the sentence like in the piece of code I attached. This way you will not lose any data, even though the representations will be less reliable.

thanks, I solved it in a truncated way.

leileilin commented 2 years ago

Another problem is bert-base-chinese_chinese_train_head.jsonlines.pickle stands for?

vdobrovolskii commented 2 years ago

That's a cache file, it's safe to delete it

vdobrovolskii / wl-coref

about the training process #21