Closed 3kumar closed 7 years ago
Hey @3kumar,
I can see three reasons for the slow training:
Default algorithm in python-crfsuite is L-BFGS; for this algorithm table shows that example task took ~ 5-10 minutes. Because of different features, output label set and dataset size for your task training should be 10 to 100 times slower (a ballpark estimate); it looks consistent with what you're observing.
@kmike :
Thanks for your reply. And pointing out the possible reasons.
On the link (http://www.chokkan.org/software/crfsuite/benchmark.html) with heading: "Training speed (for one iteration)" the last line says ".. The numbers of features for sparse and dense models are 450,000-600,000 and ca. 7,500,000, respectively."
Number of features are way more less the one specified here. Can you please suggest something for how can i optimize it?
@3kumar at the link "dense" or "sparse" means (somewhat confusingly) whether model parameters are stored in dense or sparse vectors/matrices; features are always sparse there.
To make training faster you may reduce embedding dimension (e.g. to 100 instead of 500) and / or reduce output label size. SGD can be a bit faster to train than L-BFGS. I'm not sure you can do anything else to speedup the training.
There are also other toolkits which allow to parallelize training, e.g. wapiti can use all cores when training with L-BFGS algorithm, which can give a nice speedup on large servers/workstations (in my experience single-core speed is slower though). There is a Python warpper: https://github.com/adsva/python-wapiti. Wapiti doesn't support float-valued features, so you will have to "split" each embedding dimension into several boolean features, e.g. 0 <= v < 0.1
, 0.1 <= v <= 0.2
, etc.
I am bit late to the party but here are my experiences. I have tested many different ways how to use word features for NER task. The best performance (for my problems) was to run Brown clustering on some large corpus which has relation to your problem. Then add the cluster information for every word (symbol) in your training data as a feature. This way every word will result in just one extra feature (see a variant with more features below).
For example you are building a NER model and would like to tag PERSON entity. Suppose you run Brown clustering on a huge text corpus and compute brown clusters for each word. Imagine the word "John" has a Brown cluster feature "1011100". Whenever you see the word "John" in your training or testing data, just append this feature e.g. "brown=1011100" to the features list.
Since the features are inherently hierarchical and it may be difficult to set the right number of clusters, I've found beneficial to map a word to multiple features - from the coarsest cluster to the finest. The word "John" can then be mapped to three features like:
brown=10 brown=1011 brown=1011100
For NER (building lots of them), I've seen significant performance improvement by using Brown cluster features. These features were better and more practical than word2vec, glove and other variants.
@tpeng : I have 20K sentences, of average length 30 (words/sentence), Each word is represented by 500 dimensions. I have 100 output labels, My training takes 1 hours on these sentences.
Why is it taking so long? Whereas in the crfsuite "http://www.chokkan.org/software/crfsuite/benchmark.html" it is claimed that the training is fast even with too many features.