implement hidden markov models in Cython from scratch (so can remove the GHMM standard package from Sage)

williamstein commented 14 years ago

NOTE: After applying this patch(es), the following standard spkg can be completely deleted: ghmm-20080813.p0.spkg

This patch provides a complete reimplementation of Hidden Markov Models from scratch in Sage. It is now doctested and the tests should pass on all standard Sage platforms. Note that that HMM code in sage-4.3.4 and earlier is NOT tested at all, can easily segfault in all kinds of cases, leaks huge amounts of memory, etc. The new code shouldn't have those defects. Plus we get to delete a standard spkg. Moreover, this new code is much easier to modify and experiment with. Performance is generally similar, sometimes better, and sometimes worse, than the old code.

CC: @jasongrout @sagetrac-mhampton

Component: statistics

Author: William Stein

Reviewer: Marshall Hampton, Tom Boothby, Jason Grout

Merged: sage-4.4.alpha0

Issue created by migration from https://trac.sagemath.org/ticket/8547

williamstein commented 14 years ago

Description changed:

--- 
+++ 
@@ -1 +1,5 @@
+NOTE: After applying this patch(es), the following standard spkg can be completely deleted:
+    ghmm-20080813.p0.spkg

+This patch provides a complete reimplementation of Hidden Markov Models from scratch in Sage.  It is now doctested and the tests should pass on all standard Sage platforms.   Note that that HMM code in sage-4.3.4 and earlier is *NOT* tested at all, can easily segfault in all kinds of cases, leaks huge amounts of memory, etc.  The new code shouldn't have those defects.  Plus we get to delete a standard spkg.   Moreover, this new code is much easier to modify and experiment with.  Performance is generally similar, sometimes better, and sometimes worse, than the old code. 
+

williamstein commented 14 years ago

apply this. Then look at devel/sage/sage/stats/intlist.pyx, devel/sage/sage/stats/hmm (which is entirely new code), and look at the few minor bug fixes to finance/time_series.pyx

williamstein commented 14 years ago

comment:2

Attachment: trac_8547.patch.gz

williamstein commented 14 years ago

comment:3

This is now ready for review. Some notes:

Don't bother looking at the diff, except for time_series.pyx and module_list.py. Just look at at the files in stats/hmm, which are all new, and stats/intlist.pyx.
There are a number of very small API changes (for the better). This technically violates our DeprecationPolicy. However, frankly, the current code was -- upon inspection -- so bad, that there is no way anybody was actually using it for anything serious. (As was evidenced by the response on sage-devel.) And the API is only a tiny bit changed now.
I do have a long list of possible followup projects. However, I wanted to make a first solid release that at least has enough to replace the existing HMM functionality in Sage, before adding lots of new stuff.

boothby commented 14 years ago

comment:7

This patch touches matrix2.pyx which results in failures of a number of doctests. Example:

sage -t  devel/sage/sage/matrix/matrix2.pyx
**********************************************************************
File "/mnt/usb1/scratch/boothby/sage-4.3.4/devel/sage-main/sage/matrix/matrix2.pyx", line 3142:
    sage: A.restrict(W2, check=True)
Expected:
    Traceback (most recent call last):
    ...
    ArithmeticError: subspace is not invariant under matrix
Got:
    Traceback (most recent call last):
      File "/mnt/usb1/scratch/boothby/sage-4.3.4/local/bin/ncadoctest.py", line 1231, in run_one_test
        self.run_one_example(test, example, filename, compileflags)
      File "/mnt/usb1/scratch/boothby/sage-4.3.4/local/bin/sagedoctest.py", line 38, in run_one_example
        OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
      File "/mnt/usb1/scratch/boothby/sage-4.3.4/local/bin/ncadoctest.py", line 1172, in run_one_example
        compileflags, 1) in test.globs
      File "<doctest __main__.example_48[10]>", line 1, in <module>
        A.restrict(W2, check=True)###line 3142:
    sage: A.restrict(W2, check=True)
      File "matrix2.pyx", line 3167, in sage.matrix.matrix2.Matrix.restrict (sage/matrix/matrix2.c:19470)
        raise ArithmeticError, "subspace is not invariant under matrix (%s)"%msg
    ArithmeticError: subspace is not invariant under matrix (vector is not in free module)
**********************************************************************

That should be easy to fix.

williamstein commented 14 years ago

this completely replaces the previous patch

boothby commented 14 years ago

comment:8

Attachment: trac-8547-take2.patch.gz

Ok, doctests run clean now. I've tested the code out and read it over, and it works for me. I'm a little cautious giving this a positive review because I know very little about this area of math. However, given that it's replacing something that barely worked and used memory like toilet paper, I'm willing to throw caution to the wind if mhampton or jason are willing to sign off on the idea.

5d2aaf09-c963-473a-bf79-1f742a72700f commented 14 years ago

comment:9

I've been wanting to review this but I just haven't had the time. I might this weekend, but its unclear.

jasongrout commented 14 years ago

comment:10

Some notes from reading through the code looking at stylistic issues. I really wish we had line-by-line commenting like rietveld for this sort of thing.

Lots of cdef'd functions do not have doctests. I thought the policy from discussions on sage-devel was that every function (including cdef functions) should have doctests (but, of course, these doctests would have to either indirectly test the cdef'd function or would have to test a wrapper of the cdef function). Some functions (_baum_welch_gamma, for example) don't even have docstrings.
I'm curious why IntList.!getitem! does not use the sage.misc.misc_c.normalize_index function to deal with slices. How much of a performance penalty is there? Can this "python semantics" part be extracted out to be a more general cdef'd counterpart to normalize_index, so that matrices, vectors, and other types can use it better to implement !getitem!? Also, as a future enhancement, it doesn't seem that much harder for !setitem! to also support slices. At the very least, the doctests for normalize_index probably ought to be run on the !getitem! function, as they exercise a number of corner cases for the python semantics.
IntList.sum() does not have a doctest for the overflow case

williamstein commented 14 years ago

comment:11

Replying to @jasongrout:

Some notes from reading through the code looking at stylistic issues. I really wish we had line-by-line commenting like rietveld for this sort of thing.

Lots of cdef'd functions do not have doctests. I thought the policy from discussions on sage-devel was that every function (including cdef functions) should have doctests (but, of course, these doctests would have to either indirectly test the cdef'd function or would have to test a wrapper of the cdef function).

We definitely do not require doctests for cdef'd methods. The requirement is that "sage -coverage" returns a score of 100%. This is clearly stated in the patch reviewers guide. I did seriously consider them in this case, but literally every single one of these doctests would just be a literal exact copy of a doctest from elsewhere (!) but with # indirect doctest next to it. There's just no value in that. I could also make the methods cpdef, but that does incur a performance penalty -- in this case, it would be huge (which is totally unacceptable).

Some functions (_baum_welch_gamma, for example) don't even have docstrings.

I do think that all cdef'd methods should have docstrings, and will add docstrings to any that don't have them (in a part 2 patch).

I'm curious why IntList.!getitem! does not use the sage.misc.misc_c.normalize_index function to deal with slices. How much of a performance penalty is there?

I'll switch to using normalize_index, since I'm not concerned with performance for list indexing, since everywhere that performance matters, I'm directly accessing the memory buffer (which is the only way to really compete with tightly coded C libraries). So my updated patch will change to use normalize_index, unless there is a serious problem with doing so.

Can this "python semantics" part be extracted out to be a more general cdef'd counterpart to normalize_index, so that matrices, vectors, and other types can use it better to implement !getitem!?

Perhaps. I'm certainly not doing so for this patch.

Also, as a future enhancement, it doesn't seem that much harder for !setitem! to also support slices. At the very least, the doctests for normalize_index probably ought to be run on the !getitem! function, as they exercise a number of corner cases for the python semantics.

Fortunately this won't be necessary since I'm switching to it.

IntList.sum() does not have a doctest for the overflow case

I can add that. I like testing all corner cases.

Thanks for your helpful review of style...

William

5d2aaf09-c963-473a-bf79-1f742a72700f commented 14 years ago

comment:12

I got one numerical noise doctest failure on OS X 10.5.8:

sage -t devel/sage-t2/sage/stats/hmm/util.pyx

File "/Users/mh/sagestuff/sage-4.3.5/devel/sage-t2/sage/stats/hmm/util.pyx", line 55: sage: T.sum() Expected: 1.0
Got: 0.99999999999999978

I used this code in class yesterday (see http://wiki.sagemath.org/interact/stats#HiddenMarkovModel.3ATheOccasionallyDishonestCasino) and for that limited purpose it seemed fine.

It would be nice to have the forward and backward algorithms more exposed, but I don't think that needs to go into this ticket unless you feel like it.

Apart from the noise issue I think can give this a positive review.

williamstein commented 14 years ago

comment:13

I've attached a part 2 patch. I made sure all cdef's methods have docstrings and also that doctests are properly sphinx formated (some weren't since they were copied from the old code).
Jason said above that IntList.sum doesn't have a doctest for the overflow case... but it does, so I don't know what he meant.
I read the source code for sage.misc.misc_c.normalize_index and cannot bring myself to use that in this situation. That function actually returns a Python list of Python ints for every single index into the list that is being sliced! That would easily lead to factor of 50-100 slowdowns on realistic operations:

sage: timeit('z = sage.misc.misc_c.normalize_index(slice(1,10^5),10^5)')   # slow because constructions a big python list
125 loops, best of 3: 2.17 ms per loop
sage: a = stats.IntList([1..10^5])
sage: timeit('a[1:10^5]')                       # slice is just a memcpy
625 loops, best of 3: 48.4 µs per loop
sage: 2.17/0.0484
44.8347107438017

Here's an example with a step:

sage: a = stats.IntList([1..10^5])
sage: timeit('a[1:10^5:2]')
625 loops, best of 3: 92.2 µs per loop
sage: timeit('z = sage.misc.misc_c.normalize_index(slice(1,10^5,2),10^5)')
625 loops, best of 3: 772 µs per loop

and that 772 microseconds is before we do the actual iteration through the returned list of python ints, convert them to c ints, copy stuff around in memory, etc.

This stats code I'm writing is really meant to be industrial strength -- the sort of code maybe somebody would use in "realtime" processing of large datastreams. I don't want slow functions anywhere in there.

-- William

williamstein commented 14 years ago

part 2; apply this and the previous patch

jasongrout commented 14 years ago

comment:14

Attachment: trac_8547-take2-part2.patch.gz

Replying to @williamstein:

Jason said above that IntList.sum doesn't have a doctest for the overflow case... but it does, so I don't know what he meant.

I meant that the doctest looks like this:

Note that there can be overflow, since the entries are C ints:: 
    sage: a = stats.IntList([2^30,2^30]); a 
    [1073741824, 1073741824]

That's it. There's no test there; you're just creating a list, not summing anything.

I read the source code for sage.misc.misc_c.normalize_index and cannot bring myself to use that in this situation. That function actually returns a Python list of Python ints for every single index into the list that is being sliced! That would easily lead to factor of 50-100 slowdowns on realistic operations:

This stats code I'm writing is really meant to be industrial strength -- the sort of code maybe somebody would use in "realtime" processing of large datastreams. I don't want slow functions anywhere in there.

I agree.

williamstein commented 14 years ago

comment:15

Replying to @jasongrout:

Replying to @williamstein:

Jason said above that IntList.sum doesn't have a doctest for the overflow case... but it does, so I don't know what he meant.

I meant that the doctest looks like this:
Note that there can be overflow, since the entries are C ints:: 
    sage: a = stats.IntList([2^30,2^30]); a 
    [1073741824, 1073741824] 
That's it. There's no test there; you're just creating a list, not summing anything.

Thanks for the clarification -- I was being dense; I've put up a part 3 that addresses this.

So, can I get a positive review now?

williamstein commented 14 years ago

Attachment: trac_8547-take2-part3.patch.gz

5d2aaf09-c963-473a-bf79-1f742a72700f commented 14 years ago

comment:16

OK, positive review. I think this and basic_stats should be added to the reference manual, but that can be another ticket I guess.

jhpalmieri commented 14 years ago

Author: William Stein

jhpalmieri commented 14 years ago

Reviewer: Marshall Hampton, Tom Boothby, Jason Grout

jhpalmieri commented 14 years ago

Merged: sage-4.4.alpha0

jhpalmieri commented 14 years ago

comment:18

Merged in 4.4.alpha0:

trac-8547-take2.patch
trac_8547-take2-part2.patch
trac_8547-take2-part3.patch

sagemath / sage

implement hidden markov models in Cython from scratch (so can remove the GHMM standard package from Sage) #8547