Training takes too long

3kumar commented 8 years ago

@tpeng : I have 20K sentences, of average length 30 (words/sentence), Each word is represented by 500 dimensions. I have 100 output labels, My training takes 1 hours on these sentences.

Why is it taking so long? Whereas in the crfsuite "http://www.chokkan.org/software/crfsuite/benchmark.html" it is claimed that the training is fast even with too many features.

kmike commented 8 years ago

Hey @3kumar,

I can see three reasons for the slow training:

Word embedding features are dense - all these 500 dimensions are non-zero for each word. CRFSuite results in the docs are for sparse features - e.g. even though there are 7385312 possible features only a tiny bit of them are ON for each token. I think there should be only 10-50 features active for each token in the CRFsuite example, where in your example it is at least 500. Training time scales lineary with an average number of active features.
Another possible reason is a size of your output label set. Training speed has an O(N^2) term regarding a number of output labels; for the task in docs there were 22 output labels - it means computing transition probabilities takes 20x more time in your task.
Data size in CRFsuite example looks about 2x smaller; training time should scale lineary with the dataset size.

Default algorithm in python-crfsuite is L-BFGS; for this algorithm table shows that example task took ~ 5-10 minutes. Because of different features, output label set and dataset size for your task training should be 10 to 100 times slower (a ballpark estimate); it looks consistent with what you're observing.

3kumar commented 8 years ago

@kmike :

Thanks for your reply. And pointing out the possible reasons.

On the link (http://www.chokkan.org/software/crfsuite/benchmark.html) with heading: "Training speed (for one iteration)" the last line says ".. The numbers of features for sparse and dense models are 450,000-600,000 and ca. 7,500,000, respectively."

Number of features are way more less the one specified here. Can you please suggest something for how can i optimize it?

kmike commented 8 years ago

@3kumar at the link "dense" or "sparse" means (somewhat confusingly) whether model parameters are stored in dense or sparse vectors/matrices; features are always sparse there.

kmike commented 8 years ago

To make training faster you may reduce embedding dimension (e.g. to 100 instead of 500) and / or reduce output label size. SGD can be a bit faster to train than L-BFGS. I'm not sure you can do anything else to speedup the training.

There are also other toolkits which allow to parallelize training, e.g. wapiti can use all cores when training with L-BFGS algorithm, which can give a nice speedup on large servers/workstations (in my experience single-core speed is slower though). There is a Python warpper: https://github.com/adsva/python-wapiti. Wapiti doesn't support float-valued features, so you will have to "split" each embedding dimension into several boolean features, e.g. 0 <= v < 0.1, 0.1 <= v <= 0.2, etc.

usptact commented 8 years ago

I am bit late to the party but here are my experiences. I have tested many different ways how to use word features for NER task. The best performance (for my problems) was to run Brown clustering on some large corpus which has relation to your problem. Then add the cluster information for every word (symbol) in your training data as a feature. This way every word will result in just one extra feature (see a variant with more features below).

For example you are building a NER model and would like to tag PERSON entity. Suppose you run Brown clustering on a huge text corpus and compute brown clusters for each word. Imagine the word "John" has a Brown cluster feature "1011100". Whenever you see the word "John" in your training or testing data, just append this feature e.g. "brown=1011100" to the features list.

Since the features are inherently hierarchical and it may be difficult to set the right number of clusters, I've found beneficial to map a word to multiple features - from the coarsest cluster to the finest. The word "John" can then be mapped to three features like: brown=10 brown=1011 brown=1011100

For NER (building lots of them), I've seen significant performance improvement by using Brown cluster features. These features were better and more practical than word2vec, glove and other variants.

scrapinghub / python-crfsuite

Training takes too long #40