kmkurn / pytorch-crf

(Linear-chain) Conditional random field in PyTorch.
https://pytorch-crf.readthedocs.io
MIT License
935 stars 151 forks source link

Cross Entropy as a loss function #60

Closed ValerioNeriGit closed 2 years ago

ValerioNeriGit commented 4 years ago

Hi,

I would like to use Cross Entropy as a loss function so I wrote this code:


### __init__
    self.softmax = m = nn.LogSoftmax(dim=2)
    self.crf = CRF(hparams.num_classes, batch_first=True)

### forward
    if is_train:       
          output = self.softmax(output)
          slot_loss = -1 * self.crf(output, y, mask=mask)  # negative log-likelihood
          return slot_loss
    return self.crf.decode(output)

That should work by definition of Cross Entropy but I'm getting loss on very different scales, something like 0.96 when using just the cross entropy loss from pytorch (without crf) and something like 150.8 when using the code above.

Furthermore, I'm getting slightly worse performance when using the CRF compared to not using it, around 1% difference. While with an earlier architecture of the same network on the same dataset gave a significant performance improvement.

Is there something wrong with my code?

Thank you

kmkurn commented 4 years ago

You should remove the softmax layer. The CRF layer is intended to replace the softmax (which considers the tags in each timestep as independent, whereas the CRF layer doesn't.). About the scale, try passing reduction='token_mean' (see #55). As for the lower performance, generally if your labels do have dependence, a linear CRF can have better performance, but again depending on the data/task, it may not. A difference of 1% seems somewhat reasonable (if it's 10% then something is fishy).

ValerioNeriGit commented 4 years ago

I'm doing a Named Entities Recognition task so the labels are dependent from each other, futhermore, a CRF layer is used in several proposed architectures.

With token_mean the loss as the expected magnitude, so thank you for that. I tried that a those reductions but they consistently obtain lower score than the code I wrote above or the one without the CRF, for example for the same number of epochs I obtain:

81% with reduction token_mean 89% with the code above 89% without CRF

with some epochs more the one with token_mean reach 85%.

I used the logSoftmax layer trying to "convert" the negative log likelihood of the CRF in Cross Entropy similar to how cross entropy is defined in the pytorch's source code.

Placing that softmax layer defeat the purpose of the CRF?