kmkurn / pytorch-crf

(Linear-chain) Conditional random field in PyTorch.
https://pytorch-crf.readthedocs.io
MIT License
935 stars 151 forks source link

Index out of bound error #41

Closed mrkdian closed 2 years ago

mrkdian commented 4 years ago
crf = CRF(5, batch_first=True)
score = torch.randn(1, 3, 5)
target = torch.tensor([
    [1, 2, -100],
], dtype=torch.long)
mask = torch.tensor([
    [1, 1, 0]
], dtype=torch.uint8)

print(crf(score, target, mask=mask)) # index out of bound error

target = torch.empty_like(target).copy_(target)
target[target == -100] = 0
print(crf(score, target, mask=mask)) # fix

forward algorithm relies on the masked value. -100 is useful for the downstream. it is ugly to copy the target every time, so can this be fixed inside the crf?

kmkurn commented 4 years ago

Hi, the tags tensor is meant to store indices, so there can't be a negative value. I believe raising index of out bounds error is actually a good thing here, so I don't think this is an issue to be fixed, sorry. You're free to fork and suit the code to your needs though.

JalinWang commented 2 years ago

Hi, the tags tensor is meant to store indices, so there can't be a negative value. I believe raising index of out bounds error is actually a good thing here, so I don't think this is an issue to be fixed, sorry. You're free to fork and suit the code to your needs though.

However, sometimes we pad the label sequence with an addtional token, e.g., [PAD] in transformers.BertTokenizer which will make some elements in tags tensor exceed num_tags-1. image I think the behavior of the library should not depend on how to pad by users.

My suggested solution for fixing it is simple - masking the tags by tags * mask before indexing. This is the idea, while actual implement could be doingtags * mask before mask = mask.float().

Original code is posted as follows.

for i in range(1, seq_length):
    # Transition score to next tag, only added if next timestep is valid (mask == 1)
    # shape: (batch_size,)
    score += self.transitions[tags[i - 1], tags[i]] * mask[i]

    # Emission score for next tag, only added if next timestep is valid (mask == 1)
    # shape: (batch_size,)
    score += emissions[i, torch.arange(batch_size), tags[i]] * mask[i]

If you agree, I can make a PR then.

kmkurn commented 2 years ago

Hi @JalinWang. From my understanding, one would tokenize the input words and not the labels/tags. You don't have to set the padding tag index equal to the padding token index defined by the tokenizer; just use whatever index between 0 and num_tags-1 and pass the correct mask to forward/decode and it'll ignore the padding. Your proposed fix illustrates this as it essentially sets the padding tag index to 0.

Thanks for the PR offer, but I prefer not changing the code. While tag indices not between 0 and num_tags-1 can correspond to padding tags, it can also mean a bug in how you process/numericalise your tags. The latter is much harder to detect without any check. With this, I'm closing the issue.

shaked571 commented 2 years ago

I had this issue for a day now...tried to fix it myself and things get messy, it wasn't intuitive for me and took me time to understand where this -100 came from, I would at list add a check with informative message that warn a user.

I was also surprised I didn't see any mentioning of it, until I looked for padding in the close issues...