neuralmind-ai / portuguese-bert

Portuguese pre-trained BERT models
Other
792 stars 122 forks source link

Question on crf layer, why loop through batch before crf layer? #37

Open lkqnaruto opened 2 years ago

lkqnaruto commented 2 years ago

I checked out the code, where you said :

 # 1- the CRF package assumes the mask tensor cannot have interleaved
# zeros and ones. In other words, the mask should start with True
# values, transition to False at some moment and never transition
# back to True. That can only happen for simple padded sequences.
# 2- The first column of mask tensor should be all True, and we
# cannot guarantee that because we have to mask all non-first
# subtokens of the WordPiece tokenization.

Can you explain a little bit on that? I'm still confused what you mean here. What does it mean by "interleaved zeros and ones"?

Thank you

fabiocapsouza commented 2 years ago

Hi @lkqnaruto ,

By interleaved zeros and ones, I meant a mask like [0, 1, 0, 1, 1, 0, 0, 0, 1, ...] instead of [1, 1, 1, 1, 0, 0, 0]. Because we are using WordPiece that is subword tokenization, all word continuation tokens (that start with ##) do not have an associated tag prediction for NER task, otherwise words that are tokenized into 2+ tokens would have multiple predictions.

For instance, suppose we have these tokens: tokens = ["[CLS]", "Al", "##bert", "Ein", "##stein", ...] The mask would be mask = [0, 1, 0, 1, 0, ...] which is incompatible with the CRF package. So we have index the sequence using the mask and pass only ["Al", "Ein", ...] to CRF.

The mask is different for each sequence of the batch and have different lengths (sum of 1's), so this masking is not trivial to do without an explicit for loop.

lkqnaruto commented 2 years ago

Hi @lkqnaruto ,

By interleaved zeros and ones, I meant a mask like [0, 1, 0, 1, 1, 0, 0, 0, 1, ...] instead of [1, 1, 1, 1, 0, 0, 0]. Because we are using WordPiece that is subword tokenization, all word continuation tokens (that start with ##) do not have an associated tag prediction for NER task, otherwise words that are tokenized into 2+ tokens would have multiple predictions.

For instance, suppose we have these tokens: tokens = ["[CLS]", "Al", "##bert", "Ein", "##stein", ...] The mask would be mask = [0, 1, 0, 1, 0, ...] which is incompatible with the CRF package. So we have index the sequence using the mask and pass only ["Al", "Ein", ...] to CRF.

The mask is different for each sequence of the batch and have different lengths (sum of 1's), so this masking is not trivial to do without an explicit for loop.

Thank you for the reply, a follow up question: Why we have to loop through each batch before crf? I think crf package can handle batch-wise calculation.

ViktorooReps commented 2 years ago

Hi @fabiocapsouza,

I'm experimenting with different ways of subword handling for CRF layer. Why have you chosen to just take first subtoken? Wouldn't some sort of pooling of subword representations work better?

I would greatly appreciate if you could share your thoughts on the matter!

fabiocapsouza commented 2 years ago

Hi @ViktorooReps , I used the first subtoken because it is the way BERT does it for NER, so it is the simplest way to add CRF on top of it. Yeah, maybe some sort of pooling could be better, even though subword representations are already contextual. It would be a nice experiment.