Closed ianozsvald closed 2 years ago
Hi Ian! Thanks for taking interest in our little library. I just had a look at your presentation. I'd like to add, that for all the cases, where the keys are in fact already not excessively large, positive integers, e.g. ascii chars, you can usually skip the factorize completely. Instead, just cast them to int and use them as a group index. It's not necessary for the group index to contain all the numbers from 0 to x.
Hello. Thanks for your library, it has been on my radar for a while. I just spoke about it at NDR.ai & DevDays conferences in my "Faster Pandas" talk. I'm the co-author of O'Reilly's High Performance Python book (2nd ed contains Pandas tips) and I teach a public course on this topic. Getting numeric Python code to run faster is a big deal for me.
I was playing with your
aggregate
using Numba and it works nicely for my limited testing. I realised that the need to make categories up-front is a bit of a blocker so I've done some digging. NumPy'sunique
function sorts data before building the key set so it is slow (I have no idea why it does a sort). Pandas' factorize method doesn't sort and is very fast. With a bit of messing around I came up with a faster solution for some data distributions which, possibly, could be added to your library (or maybe just left here).I'm including a code sample for discussion, if you're interested.
Code:
Thoughts?
I think at the least on the README telling Pandas uses to prefer
pd.factorize
or if they've used acategory
encoding to useser.cat.codes
as they've been pre-computed would be useful advice.If you haven't looked behind a Pandas category then this slide shows the
.cat
attribute: https://speakerdeck.com/ianozsvald/ndr-2021?slide=14The dataset for that talk is the UK Government House Price dataset, 25M rows since 1995 of house sales and the distribution of "is new vs is old" is very clumpy, so my v3 algorithm above outperforms
factorize
. Maybe that helps other folk with similarly-clumpy data.