summary statistics for training

jmschrei / yahmm

Yet Another Hidden Markov Model repository.

MIT License

249 stars 32 forks source link

summary statistics for training #22

Closed jmschrei closed 10 years ago

jmschrei commented 10 years ago

The current training involves creating a matrix with all observed symbols on one axis and the character generating states on the other axis. This can be extremely slow with either very long sequences, or many sequences, or both. This can be made better if, for every sequence, summary statistics were calculated for each of the states. The new emissions would be calculated for each state, with a weight based on how likely they are. Then, when all sequences are analyzed, the distributions would be trained on these summary statistics. The problem is that each distribution class now has to have a summary statistics method which will return the summary statistic, and a separate train method which will take in a matrix of summary statistics and train on them. This can't be implemented until all distributions have an appropriate summary statistic method. I'm going to take a look at this next week.

jmschrei commented 10 years ago

I began implementing this, by having distributions have a summarize and a from_summaries method. The summarize method reduces an array of items and an array of weights down to the summary statistics needed for training, and stores them in a new attribute summaries. The summarize method is then expected to be called on multiple sets of data, before the from summaries method is called, which reduces those summary statistics down into a new distribution.

I successfully implemented this for NormalDistribution, and on a test case saw training which used to take slightly over 10 seconds now take 6.5 seconds. I didn't quantify the reduction in memory.

However, the end parameters were ~slightly~ different. Using the traditional from_sample method the distribution is NormalDistribution( 10.0031358892, 1.001060889 ). Using the summary method, I get NormalDistribution( 10.0031358893, 1.00106088899 ). However, I'm going to assume this tiny difference is insignificant. @adamnovak let me know if you disagree.

adamnovak commented 10 years ago

Looks insignificant to me. Probably just comes from changing the order in which you're summing things up.

On Sat, Aug 16, 2014 at 6:10 PM, Jacob Schreiber notifications@github.com wrote:

I began implementing this, by having distributions have a summarize and a from_summaries method. The summarize method reduces an array of items and an array of weights down to the summary statistics needed for training, and stores them in a new attribute summaries. The summarize method is then expected to be called on multiple sets of data, before the from summaries method is called, which reduces those summary statistics down into a new distribution.

I successfully implemented this for NormalDistribution, and on a test case saw training which used to take slightly over 10 seconds now take 6.5 seconds. I didn't quantify the reduction in memory.

However, the end parameters were ~slightly~ different. Using the traditional from_sample method the distribution is NormalDistribution( 10.0031358892, 1.001060889 ). Using the summary method, I get NormalDistribution( 10.0031358893, 1.00106088899 ). However, I'm going to assume this tiny difference is insignificant. @adamnovak https://github.com/adamnovak let me know if you disagree.

— Reply to this email directly or view it on GitHub https://github.com/jmschrei/yahmm/issues/22#issuecomment-52410575.

jmschrei commented 10 years ago

This works now for all distributions, and is always faster than the alternative on a timetest involving 20-100 sequences of 1000-10000 symbols each (not to say it's not faster in the other cases, but I haven't done extensive time testing yet). The most speed gains are seen in NormalDistribution, where it's around 53% faster, the smallest is in GammaDistribution, likely due to the iterative process by which the distribution is updated, and is around ~5%.

I'm leaving in the Distribution.from_sample method because it is useful. I also need to figure out what to do with Kernel densities, other than the trivial implementation. Maybe something involving splines.

This has greatly expanded the number of lines of code in the various distribution classes, and yahmm.pyx is now 3956 lines of code. I tried to see if it would be trivial to move these lines of code into their own files, but I'm not sure how to import from a cython file to a cython file, except for replicating the lines of code in __init__.py.

jmschrei commented 10 years ago

DiscreteDistribution is giving me some trouble still. The code is ignoring the large number of small weights, causing a tiny change in the probabilities. An example of training using the 2 state HMM for CG island modelling training on 20 sequences of variable length between 4000 and 4100 nucleotides:

Using from_sample:

DiscreteDistribution({'A': 0.047921607580379277, 'C': 0.45340901865545663, 'T': 0.04776502803470474, 'G': 0.45090434572945498})
DiscreteDistribution({'A': 0.25493418679201219, 'C': 0.24624926464109514, 'T': 0.25426667029062183, 'G': 0.24454987827626631})

Using from_summaries:

DiscreteDistribution({'A': 0.048453983670075404, 'C': 0.45221035118601211, 'T': 0.047794236257524748, 'G': 0.45154142888638926})
DiscreteDistribution({'A': 0.25489777981982931, 'C': 0.24666792598018389, 'T': 0.25385080975970814, 'G': 0.24458348444027372})

I'm not sure what the most appropriate way to continue is. If I was going to sort by weight, that wouldn't exactly work for separate summaries.

jmschrei commented 10 years ago

I should also mention that the from_summaries training had slightly larger training improvement than the from_sample method.

jmschrei commented 10 years ago

This issue has been resolved. I was normalizing the weights on a per-sequence basis, when you need to do it on the entire sample. This mislead me when I was writing tests because I was not using weights and getting the same result, which would be the expected behavior in this case.

Summary statistic training has been implemented and tested for every distribution. The nosetests have been updated to include verification of this on every distribution, with some being more in-depth than others.

I am going to run a few speed tests comparing this training versus full batch training on varying sample sizes. If summary statistics are always just as good or better while producing the same result, I am going to remove full batch as an option from training.