kmkurn / pytorch-crf

(Linear-chain) Conditional random field in PyTorch.
https://pytorch-crf.readthedocs.io
MIT License
935 stars 151 forks source link

Should I manually set the transition probability to zero where O->I in NER task? #90

Closed lkqnaruto closed 2 years ago

lkqnaruto commented 2 years ago

Hi

Thank you for the amazing work! I'm currently using the pytorch-crf package on NER task. But I wonder in NER task, for cases like O->I, Should I manually set the corresponding entries in the transition probability matrix to zero? I went through the pytorch-crf code, and didn't see such settings.

Thanks in advance!

kmkurn commented 2 years ago

Hi, thanks for using the library. Yes, you should set it manually if you want to ensure such invalid transitions to never occur when decoding. In practice the model should be able to learn the constraints during training, but manually setting the transition scores is so easy anyway so it doesn't really hurt if you do it.

One thing though, the transitions parameter stores transition scores, which is in log space. So you should set it to a large negative number (say -1e9) for invalid transitions, and not zero.

lkqnaruto commented 2 years ago

Hi, thanks for using the library. Yes, you should set it manually if you want to ensure such invalid transitions to never occur when decoding. In practice the model should be able to learn the constraints during training, but manually setting the transition scores is so easy anyway so it doesn't really hurt if you do it.

One thing though, the transitions parameter stores transition scores, which is in log space. So you should set it to a large negative number (say -1e9) for invalid transitions, and not zero.

Thank you for the quick reply.

Yes, I was thinking that throughout training, model can actually learn such constraint ("O->I") even if we don't explicitly set large negative number to the corresponding entries in the transition matrix, right?

But if we do that, do you think this can further improve the model's performance?

kmkurn commented 2 years ago

I'm not sure as it kinda depends on many factors. I usually just do it because it's so easy and can only improve performance 😃

lkqnaruto commented 2 years ago

I'm not sure as it kinda depends on many factors. I usually just do it because it's so easy and can only improve performance 😃

Should I also set large negative number in the starting transition matrix and ending transition matrix ?

kmkurn commented 2 years ago

Assuming you're using BIO scheme, I would set the start transition score for I-X tags to be -1e9 as well as it's also invalid, and keep the end transition scores unchanged.

lkqnaruto commented 2 years ago

Assuming you're using BIO scheme, I would set the start transition score for I-X tags to be -1e9 as well as it's also invalid, and keep the end transition scores unchanged.

Thank you for the reply, yes, I used BIO scheme. What does I-X mean? The X corresponding to the tag of subtoken if I use BERT based NER model?

Actually this is another one question I really want to get some advices from you.

Method 1 For instance, suppose we have these tokens: tokens = ["[CLS]", "Al", "##bert", "Ein", "##stein", ...]

So I have index the sequence using the mask and pass only ["Al", "Ein", ...] to CRF. Basically, I filter out those [PAD] [CLS] [SEP] tokens and subtokens("##bert"), and only feed tokens like ["Al", "Ein", ...] to the CRF.

Method 2 For instance, suppose we have these tokens: tokens = ["[CLS]", "Al", "##bert", "Ein", "##stein", ...] So I don't filter anything out before CRF, instead, I feed everything into CRF. Within CRF, [PAD] can be handled by the mask. But for those subtokens, I will set their corresponding entries in transition matrix to a very large negative value.

Which method do you think is more reasonable? I was thinking that these two methods are essentially the same, however, method 1 probably more computational efficient since CRF only needs to maintain a small size of matrix throughout training.

Any suggestions?

kmkurn commented 2 years ago

What does I-X mean?

I just meant any inside tag e.g., I-PER, I-LOC, etc.

Which method do you think is more reasonable?

I personally use method 1. I'm not sure how to set the transition matrix to make method 2 work though.

takipipo commented 2 years ago

Assuming you're using BIO scheme, I would set the start transition score for I-X tags to be -1e9 as well as it's also invalid, and keep the end transition scores unchanged.

Thank you for the reply, yes, I used BIO scheme. What does I-X mean? The X corresponding to the tag of subtoken if I use BERT based NER model?

Actually this is another one question I really want to get some advices from you.

Method 1 For instance, suppose we have these tokens: tokens = ["[CLS]", "Al", "##bert", "Ein", "##stein", ...]

So I have index the sequence using the mask and pass only ["Al", "Ein", ...] to CRF. Basically, I filter out those [PAD] [CLS] [SEP] tokens and subtokens("##bert"), and only feed tokens like ["Al", "Ein", ...] to the CRF.

Method 2 For instance, suppose we have these tokens: tokens = ["[CLS]", "Al", "##bert", "Ein", "##stein", ...] So I don't filter anything out before CRF, instead, I feed everything into CRF. Within CRF, [PAD] can be handled by the mask. But for those subtokens, I will set their corresponding entries in transition matrix to a very large negative value.

Which method do you think is more reasonable? I was thinking that these two methods are essentially the same, however, method 1 probably more computational efficient since CRF only needs to maintain a small size of matrix throughout training.

Any suggestions?

According to method 1, you mentioned that you use the mask to mask out [CLS] token. I did the same thing but I got this error mask of the first timestep must all be on. My input_mask is [0, 1,1,1 ..., 1, 0, 0, ...]. According to my understanding, the first time step for masking cannot be 0. So, what is the correct way of masking [CLS]? Do I have to slice the emission before passing it to the CRF?

kmkurn commented 2 years ago

@takipipo So the mask that the CRF class accepts is a length mask, so you shouldn't use it for masking anything else. The code does a number of checks to ensure that, one of them being that on the 1st timestep (the assumption is that it doesn't make sense to have a zero-length input). So to mask the [CLS] token, slicing the emission is what I'd do. Hopefully this answers your question.

takipipo commented 2 years ago

@kmkurn Thanks for the reply. If I am not mistaken, by slicing the emission, thetorch.tensor will be inconsistent which leads to an error. Do you have any solution?

kmkurn commented 2 years ago

@takipipo Can you elaborate more what it would be inconsistent with?