grantjenks / python-runstats

Python module for computing statistics and regression in a single pass.
http://grantjenks.com/docs/runstats/
Other
96 stars 19 forks source link

Bogus conversion #2

Closed FlorianWilhelm closed 7 years ago

FlorianWilhelm commented 7 years ago

Looking at the source code I found something peculiar:

def push(self, value):
        """Add `value` to the Statistics summary."""
        values = float(value)
        self._min = value if self._min is None else min(self._min, value)
        self._max = value if self._max is None else max(self._max, value)
        delta = value - self._eta
        delta_n = delta / (self._count + 1)
        delta_n2 = delta_n * delta_n
        term = delta * delta_n * self._count
        self._count += 1

You convert value to float but save it in values (tailing s!). In the rest of your function you are using only value. Is this an error?

grantjenks commented 7 years ago

Yeah, that's a mistake. Introduced here: https://github.com/grantjenks/python_runstats/commit/06ec42c09e7be8382abcb8b52431dcb0c7717c64 The conversion shouldn't be necessary but I added it out of paranoia. The calculation of delta should coerce to float because self._eta is initialized as 0.0.

grantjenks commented 7 years ago

Fixed at 6d1be9d and deployed at 0.6.0

FlorianWilhelm commented 7 years ago

@grantjenks Thanks for the fast fix. Since your package is only working with float it should also be possible to convert it with cython quite easily right? What do you think about it? It should improve the performance quite a lot. What do you also think about a way to set the parameters of the Statistics object with externally provided count, eta etc. parameters like in Scikit-Learn's set_params and get_params methods. That way one could fetch and store these parameters to later on initialize a new Statistics object. Another idea would be to use @property for methods like mean, variance etc.

grantjenks commented 7 years ago

Re: cython - Sure, I'm open to it. I think native types would make it much faster. Do you have experience distributing packages that support Cython-izing?

Re: get_params and set_params - Why not just use pickle and copy?

Re: @property for mean, variance, etc. - Most of these summary statistics require a calculation. Why hide that behind a property to save two characters?

FlorianWilhelm commented 7 years ago

cython: Haven't done it so far myself but I know some people who did it. As far as I know if setup.py is configured so that the sources are build you can just use python setup.py bdist_wheel (having wheel installed) to generate binary sources that include the binary blobs.

get/set_params: A pickle is something large and really complicated :-) The whole state of your object is defined by only 7 float variables, so why save more than that? Let's say I use your library to generate statistics from some data stream up to a certain point in time, midnight. I want to store that information in a database and after a few hours I want to continue from midnight and adding new data that came in. If I would just save count, eta... etc in the database I could reinitialize the Statistics object and could go on. Of course I can do that right now too but accessing private member variables is not nice.

property: Okay, that is purely a matter of taste. Actually the main computation is already done in the push method. There is only a small and really fast calculations to get the actual statistics afterwards. To convey that I like properties more. If it really was a long running computation I would agree with using methods instead. But as I said, that surely is a matter of taste.

grantjenks commented 7 years ago

Re: cython - done. Deployed at version 1.0. There's an interesting permutation of install scenarios. I released the package from my macbook using "sdist bdist_wheel --universal" so that included the binary I built locally and source. On PyPI you'll see the Mac OSX binary alongside the source. So when you install, there are several options:

I added tests/benchmark.py to observe the Cython-impact. I also added types. Example benchmark output:

$ python benchmark.py 
core.Statistics: 0.0214369297028
fast.Statistics: 0.000623941421509
  Stats Speedup: 33.36x faster
core.Regression: 0.0528829097748
fast.Regression: 0.00235199928284
   Regr Speedup: 21.48x faster

So Statistics objects are 30x faster and Regression objects are 20x faster.

Re: get_params/set_params - pull request welcome. In fact, Cython broke pickle support so all the more reason.

grantjenks commented 7 years ago

@FlorianWilhelm reply in issues #3, #4 or #5. The Bogus Conversion is fixed.