Closed FlorianWilhelm closed 7 years ago
Yeah, that's a mistake. Introduced here: https://github.com/grantjenks/python_runstats/commit/06ec42c09e7be8382abcb8b52431dcb0c7717c64 The conversion shouldn't be necessary but I added it out of paranoia. The calculation of delta
should coerce to float because self._eta
is initialized as 0.0
.
Fixed at 6d1be9d and deployed at 0.6.0
@grantjenks Thanks for the fast fix. Since your package is only working with float it should also be possible to convert it with cython quite easily right? What do you think about it? It should improve the performance quite a lot.
What do you also think about a way to set the parameters of the Statistics
object with externally provided count
, eta
etc. parameters like in Scikit-Learn's set_params
and get_params
methods. That way one could fetch and store these parameters to later on initialize a new Statistics
object.
Another idea would be to use @property
for methods like mean
, variance
etc.
Re: cython - Sure, I'm open to it. I think native types would make it much faster. Do you have experience distributing packages that support Cython-izing?
Re: get_params and set_params - Why not just use pickle and copy?
Re: @property for mean
, variance
, etc. - Most of these summary statistics require a calculation. Why hide that behind a property to save two characters?
cython: Haven't done it so far myself but I know some people who did it. As far as I know if setup.py is configured so that the sources are build you can just use python setup.py bdist_wheel
(having wheel installed) to generate binary sources that include the binary blobs.
get/set_params: A pickle is something large and really complicated :-) The whole state of your object is defined by only 7 float variables, so why save more than that? Let's say I use your library to generate statistics from some data stream up to a certain point in time, midnight. I want to store that information in a database and after a few hours I want to continue from midnight and adding new data that came in. If I would just save count
, eta
... etc in the database I could reinitialize the Statistics
object and could go on. Of course I can do that right now too but accessing private member variables is not nice.
property: Okay, that is purely a matter of taste. Actually the main computation is already done in the push
method. There is only a small and really fast calculations to get the actual statistics afterwards. To convey that I like properties more. If it really was a long running computation I would agree with using methods instead. But as I said, that surely is a matter of taste.
Re: cython - done. Deployed at version 1.0. There's an interesting permutation of install scenarios. I released the package from my macbook using "sdist bdist_wheel --universal" so that included the binary I built locally and source. On PyPI you'll see the Mac OSX binary alongside the source. So when you install, there are several options:
I added tests/benchmark.py
to observe the Cython-impact. I also added types. Example benchmark output:
$ python benchmark.py
core.Statistics: 0.0214369297028
fast.Statistics: 0.000623941421509
Stats Speedup: 33.36x faster
core.Regression: 0.0528829097748
fast.Regression: 0.00235199928284
Regr Speedup: 21.48x faster
So Statistics
objects are 30x faster and Regression objects are 20x faster.
Re: get_params/set_params - pull request welcome. In fact, Cython broke pickle support so all the more reason.
@FlorianWilhelm reply in issues #3, #4 or #5. The Bogus Conversion is fixed.
Looking at the source code I found something peculiar:
You convert
value
to float but save it invalues
(tailing s!). In the rest of your function you are using onlyvalue
. Is this an error?