issue with imbalanced data

kmkurn / pytorch-crf

(Linear-chain) Conditional random field in PyTorch.

MIT License

935 stars 151 forks source link

Hi, no worries.

I think the issue of imbalanced data also happens in part-of-speech tagging where most words are nouns. I'm not sure if people handle this issue in POS tagging because in reality, nouns are indeed very common. What I would suggest is:

use a more powerful model for the emission score, because I think that's where you can truly improve, as opposed to the transition probabilities which are just categorical distributions, or
use more complex transition distribution, in which case you must use another library e.g. pytorch-struct that offers you more freedom in specifying the emission/transition score.

kmkurn / pytorch-crf

issue with imbalanced data #58