kmkurn / pytorch-crf

(Linear-chain) Conditional random field in PyTorch.
https://pytorch-crf.readthedocs.io
MIT License
935 stars 151 forks source link

issue with imbalanced data #58

Closed emadeldeen24 closed 4 years ago

emadeldeen24 commented 4 years ago

Firstly, thank you for sharing the code and making it easy to use. I'm using CRF to classify EEG data, as the labels are sequential and having dependencies.

However, the labels are imbalanced and CRF seems to just produce the labels of the majority class. The use of oversamling is not proper in this case, so I wonder if you may have a solution or suggestion for this issue.

Thanks.

kmkurn commented 4 years ago

Hi, no worries.

I think the issue of imbalanced data also happens in part-of-speech tagging where most words are nouns. I'm not sure if people handle this issue in POS tagging because in reality, nouns are indeed very common. What I would suggest is:

  1. use a more powerful model for the emission score, because I think that's where you can truly improve, as opposed to the transition probabilities which are just categorical distributions, or
  2. use more complex transition distribution, in which case you must use another library e.g. pytorch-struct that offers you more freedom in specifying the emission/transition score.