Open CarstVaartjes opened 10 years ago
thanks @CarstVaartjes
klib
is shipped with pandas so should be straightforward to update I think.
just a couple of files.
You are welcome to take a shot, otherways will try to get for 0.15.1
We'll take a shot tomorrow afternoon; there were quite some changes and I think Wes also added quite some own inline stuff in the current khash.c but with some diff comparisons we hope we should be okay. And the performance increase is significant, so it would be nice to make it in time for 0.15.0
ok if u could get done on next day or 2 and everything pass then could do (run a vbench as well)
@jreback we would love to make it happen, unfortunately with my current knowledge in C, Python and pandas I cannot get python setup.py build_ext --inplace
execute correctly with the new version of klib.
First error out of many others:
In file included from pandas/parser.c:360:
In file included from pandas/src/parser/tokenizer.h:36:
pandas/src/klib/khash.h:187:21: error: redefinition of '__ac_HASH_UPPER'
static const double __ac_HASH_UPPER = 0.77;
In case you would like to have a look you can find:
Everything we made is based on pandas-team's work, any tips or suggestions are welcome.
I am sorry to say, that this won't be possible in the following days.
Thanks
@FrancescElies
ok no prob. maybe can do this for 0.15.1.
something that might help is seeing what was originally changed, e.g.
git blame pandas/src/klib/khash.h
shows what/when things were changed (e.g. like a diff from the original file).
So I am not sure what @wesm changed but maybe can fiigurr out from this.
@jreback thanks for the info, actually this is exactly the way we adapted klib 0.2.8 which seems to work already a an independent package, but when I put this changes to pandas seems to have problems with other files (like parser.c), files that we don't really use in the other project.
I suppose I need to have a deeper look to the other files in pandas to understand the interconnection between them before being able to make it work properly.
Thanks again for your rapid answer
@FrancescElies np. I think its appropriate to keep it as an-line package as don't want to add an explicit dep (and it is license compat and works well that way).
that said, pls have a closer look! I am imagine the perf boost IS worth it.
and FYI, this library is also vastly faster than almost anything else,
e.g. see commentary here: http://article.gmane.org/gmane.comp.python.numeric.general/58724/match=unique+performance
essentially numpy should use klib (or similar) and use hashtables to do unique.
I might have time to look into this at the weekend as this definitely looks interesting, but I can't guarantee anything.
go @immerrr !
Very interesting, this is good to know that the quadratic probing has a practical impact in many cases.
I can try to look at the compilation issues if you can't sort them out, let me know.
I've integrated the code from 0.2.8 in #8547, but for some reason vbench results are not as astounding: new version is systematically faster, especially in all-unique cases, but the increase is nowhere near 60% (https://gist.github.com/immerrr/98a02338df8cbabb3a86).
I did refactor the benchmarks somewhat and made string values 10 chars each to be closer to real-life conditions, so that may have influenced the result.
My example might have been a bit skewed because of the limited amount of options and the short string I guess. Also, the big increase was only for strings in my example, the rest was 10-20%, but still decent and good to see that it's still 10%ish in your gist :) P.s. great work!
I wonder why your example compares StringHashTable(len(x)).factorize(x)
to pd.factorize(x)
, the latter is a lot more involved:
In [5]: np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
In [6]: %timeit pd.factorize(strs)
1000 loops, best of 3: 285 µs per loop
In [7]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
...:
10000 loops, best of 3: 171 µs per loop
As for the actual performance, I'm struggling to get quadratic probing on par with the old version, in short-string scenario it's about 5 times slower and valgrind shows that there's a lot more probing going on and strcmp calls are rather expensive:
In [2]: strs = pd.util.testing.rands_array(2, 10000)
In [3]: strs[:5]
Out[3]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)
In [4]: %%timeit
...: ht = pd.hashtable.StringHashTable(len(strs))
...: ht.factorize(strs)
...:
1000 loops, best of 3: 544 µs per loop
In [4]: np.random.seed(0); strs = pd.util.testing.rands_array(2, 10000)
In [5]: strs[:5]
Out[5]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)
In [6]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
...:
100 loops, best of 3: 2.55 ms per loop
Ok, I've realized that 3.5k two character strings that only slightly differ from each other may be a corner case for a hash function as simple as x31 and a probing scheme that favours neighboring buckets. I've re-run it on short and duplicated strings and the newer version is indeed slightly faster.
In [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
....:
10000 loops, best of 3: 158 µs per loop
In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
....:
10000 loops, best of 3: 176 µs per loop
In [18]: pd.__version__
Out[18]: '0.15.0-6-g403f38d'
In [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
....:
10000 loops, best of 3: 155 µs per loop
In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
....:
10000 loops, best of 3: 169 µs per loop
In [18]: pd.__version__
Out[18]: '0.15.0-9-g7d0d41b'
@CarstVaartjes was a conclusion ever reached about this possible upgrade?
Hi! not really as I missed the later update! It's still an interesting 10% performance increase in a core function I think. Also khash is really stable (so it's not really a moving target, so this would really get the best out of it) -> https://github.com/attractivechaos/klib/commits/master/khash.h
This was attempted in https://github.com/pandas-dev/pandas/pull/49197 but ultimately reverted due to a performance degradation with isin. The hashing expert @realead has a separate issue open to convert isin to using a set implementation rather than a hash implementation in https://github.com/pandas-dev/pandas/issues/39799 . Paired together those changes might get us moving forward on this
@WillAyd we had investigated quadratic probing back while merging #36729. If I remember results correctly: the downside of the quadratic probing was that there where some "normal" sequences for which the performance degenerated quite a bit. It was decided to use the current approach, because the probability for degeneration for "normal" sequences is much smaller.
We have some such "normal" sequences in the performance suite, which would lead to degenerate behavior of quadratic probing. This is probably what you have observed.
Yea I think this comment was a clear example of that https://github.com/pandas-dev/pandas/pull/49197#issuecomment-1320930369
Hi,
I'm working with two colleagues (@FrancescElies & @JaviBi4) on seeing if we can add groupby to bcolz, what we did first was to see if we can make Pandas' excellent klib implementation more library independent. See https://github.com/CarstVaartjes/khash A while back khash (https://github.com/attractivechaos/klib) was updated to 0.2.8:
2013-05-02 (0.2.8):
We updated the klib to it (which was something of a headache to be honest), but the quadratic probing seems to be quite faster which might make it interesting for Pandas to retrofit this into the general Pandas master. See the example:
So a 20%ish improvement for int64 and a 65% increase for strings in this example. Just thought to give you a heads up about this, as you might have missed 0.2.8 (and it's a bit of a pain to adjust everything so this might help to make Pandas even faster)
Edit: added float64 example too (10-15% improvement)