scrapinghub / python-crfsuite

A python binding for crfsuite
MIT License
771 stars 221 forks source link

Add support for word embedding like features which are list of floats #39

Open napsternxg opened 8 years ago

napsternxg commented 8 years ago

The current API doesn't support adding features which are list of floats e.g. Word Embeddings. The current approach to add these features is to do something like {"f0": 1.5, "f1": 1.6, "f2": -1.4} for 3 dimensional embedding features, which adds extra burden on the user's part.

I propose a wrapper feature which will allow users to pass the word embedding list as the value of the dictionary. E.g. {"f": FloatFeatures([1.5, 1.6, -1.4])}, internally this will convert the float features into a representation consistent with the CRFSuite ItemSequence and having a consistent naming convention like "f:0", "f:1", "f:2".

napsternxg commented 8 years ago

@kmike and @tpeng do you want to have a look at it?

EmilStenstrom commented 6 years ago

Using word embeddings improve accuracy a lot. Having a supported way to include them in python-crfsuite would be wonderful.

muhnashX commented 5 years ago

@napsternxg any updates on feeding float vectors as features? i have the same situation where i want to use glove embeddings for a NER task using crf.

napsternxg commented 5 years ago

@muhnash0 I basically did the proposed approach in my comment manually. It was quite easy.

DomHudson commented 4 years ago

I don't think the proposed approach will work. CRFsuite does not support continuous features so each unique key/value combination will be a unique feature. You have to discretize the continuous features with a technique like https://arxiv.org/abs/1711.01068

kmike commented 4 years ago

@DomHudson crfsuite does support continuous features

napsternxg commented 4 years ago

The approach I suggested is utilized in this tool I have built.

https://github.com/napsternxg/TwitterNER