Closed d1manson closed 9 years ago
Wow, great job so far! Sorry for the delay, I was on a conference over the weekend. I'll go through step by step, hope I don't miss anything.
1. dtypes We should do our best to provide some reasonable defaults, and it has to be the same default for all implementations, so this needs to go into a separate function. On the other hand, I guess it's fair to just take int64/float64 for sum/prod, and let the user choose himself to reduce this, if he knows the result to be smaller than that. Any heuristic will require inspection of the data beforehand. The better the assumption shall be, the more time costly it is, and the shorter, the more error prone to overflows it will be. I wouldn't want to do that.
For bool and int, I wouldn't bother with a separate nan type. If in doubt, the user can run an anynan
over the same data, and use this as a mask afterwards. End of story. In general, the fillvalue must be representable with the selected output dtype. So the fillvalue should also be an input for the shared dtype selection function.
2. Matching vals to idx Full acc. KISS principle.
3. ndim Yes, should be in the API.
4. Scalar vals The simple hack sounds reasonable. Not sure if I would bother spending much time in writing any optimized code for that rare case.
5. Broadcasting KISS principle, let's forget this.
6. Contiguous mode
The downscaled thing is mainly doing a compression of the outputs, skipping all empty fields, by running unique
over accmap first. It could be supported just like the scalar vals, by simply modifying accmap before acutally running the main action. It's a nicety, and I also won't bother to throw it out.
For the contiguous mode, this was my initial implementation, requireing an already sorted input. After hacking around a bit longer and improving the implementation for unsorted accmaps, it turned out, the speed advantages were actually not that great as I had thought, and I could get comparable speeds for arbitrary accmaps. It's basically one pointer indirection, that can get skipped, when accessing the data. Looks like the prefetching etc. is that great, that this is not a big cost anymore. As it is mainly a subcase of the unsorted map. It would make sense to throw that out completely.
Renaming incontiguous to something better, I'm totally fine with that. I'm not a native speaker, so I beg my pardon for the naming.
For your other suggestions on this, I'd say KISS principle again. If the user wants some 0-group, simply set some 0-values in the accmap, before running accumarray
over it.
7. Bounds checking As the functions in the inner loop, adding simple values are really simple, the whole process is mainly memory bound. So adding a bounds check is not a great overhead. E.g. the nansum and sum run in nearly the same speed, even if nansum has an additional val != val comparison inside. So for the sake of simplicity, and as segfaults are really ugly, I wouldn't consider a possibility to turn bounds checking off. It should be safe to use the library, even if there were some spooky values in your accmap.
So, I'm totally fine adding the sz argument.
8. Input variable naming vals, idx, accmap, a ... I like speaking variable names, when ever possible. idx is quite generic, used million of times anywhere in any code, it doesn't describe well what it is meant for. Iterating over it idx[i] ... what do I get? accmap is more speaking. It's a map, how to accumulate vals, accumulation map. I'd clearly prefer that over idx. Matlab calls it subscripts, that may be another good option. vals vs a, both don't give any information, so they are equally bad I'd say. Though I wouldn't have any better suggestion on that right now, I'm fine with both names. Finding something better couldn't hurt though.
It may be a good idea to change fillvalue to fill_value, as used in numpy e.g. for masked arrays.
For getting scipy weave to run, it's actually a single line with linux, not more than "apt-get install python-scipy" and everything works. I also know, what kind of pain in the ass it can be, to get toolchains or math libraries installed in windows. I haven't used windows for ages, and can't help too much with setting up scipy for it. The documentation there should cover that I hope. What weave generally does, is creating C-code on the fly, depending on the input data type, and storing the compilation results for reuse. A working C compiler is necessary for all this to work, so there is no way to provide binaries, that would run without a working toolchain.
I also wouldn't worry at all about the purepy implementation. Any sane person, trying to do some math with python will install numpy. If not - not my problem. No need to reinvent the wheel, just for a demonstration version of accumarray. If some things don't work there, so be it.
In terms of aliasing, you had the clearer approach. We should have a shared function, of converting any func-keyword-input into a unique function name string, with which every implementation has to deal itself then. Your implementation is a good start for that.
Your English is pretty much perfect (I actually spent a year in Moscow as an EFL teacher so I should know..it's certainly a lot lot better than my Russian!).
Ok...
I think we need to flesh out the accumarray_utils.py
with the common numpy
stuff...but bear in mind that I think it is a nice feature to support a fully non-numpy implementation...there are some environments where it might be a pain to get numpy
working, but where it's nice to have some simple accumarray for smallish datasets..anyway the code is pretty short and simple, so I don't think it's a big deal...it's also quite fun to benchmark it against numpy
!
1. dtypes bool
- agreed. sum
- it's pretty simple to get a safe bound just using len(vals)
and the number of bits in vals
, for example 30 000
int16
s can be safely summed into an int32
, or 60 000
uint16
s...but, yes, I think we should put all the logic in one utility function and work on it there..with regards dtype+fillvalue
, my _get_minimum_dtype
function is a basic first draft.
6. modes So, are you suggesting that the only thing we should support is something like this:
if <whatever_the_special_test_is>:
idx_is_used = np.zeros(sz,dtype=bool)
idx_is_used[idx] = True
new_idx = np.cumsum(idx_is_used) - idx_is_used
idx = new_idx[idx]
sz = max(idx)
If that's what you mean, then could we treat it as a special case of fillvalue
, e.g. fillvalue=None
, means drop all the missing values...although for multidimensional indexing you could still end up with some empty spots..maybe we could just say that it's not supported for multidimensional output. The only other reason not to use fillvalue=None
would be for the func='array'
or func='sorted'
cases, because there the user may actually want to use None
as the fillvalue...but I don't think the user will be too annoyed if they cant have that.
Just to clarify, if the above is true, then there is no mode
needed at all, right?
8. variable naming I really didn't like a
, but I guess you are right that vals
is equally meaningless!...not sure what else to suggest there. But for, the indices, how about dest_idx
(for destination index - this is in the spirit of C/asm-like naming when you provide output pointers as input args)..I felt like "accumulation map" was a bit confusing - it sounded to me either like the name of a function/lambda or else the output of the accumulation (rather than the input to it).
I'm probably not going to do any more work on this in the next few days, so you can safely commit to the branch I was working on and it shouldn't be too confusing.
9. benchmarking and testing I'm sure your stuff was good, but I was lazy and ended up basically starting from scratch in a separate file rather than trying to understand everything you were doing. As discussed initially, we should aim to get testing to a pretty high standard.
With start from scratch you mean the test_and_bench_simple module? I agree that testing and benchmarking should be separated. I'm also fine when the testing part with pytest will be up to me. I guess it could require some more comments. It's surely not too intuitive, what's going on there. For basic understanding I'd recommend pytest fixture and parametrization reads.
accmap_all and accmap_compare are different fixtures. accmap_all simply returns different accum implementations for the tests. That are the first bunch of tests, testing basic functionality that has to be guaranteed by all implementations. This could also be achieved, by parametrizing all these tests separately with all implementations. Parametrizing the fixture just does this in a more convenient way.
accmap_compare provides a whole namespace with a bunch of testing ingredients, data with and without nan, depending on what shall be tested, a reference function etc. test_compare then runs all the listed functions in func_list against any combination of implementations provided by the accmap_compare fixture.
fillvalue=None
would be fine to me. This is what is done right now in that case:
accmap = np.unique(accmap, return_inverse=True)[1]
Btw. please let's be more specific with the thrown errors in the future rework. It's really bad practice to only throw Exception
. When I would like to catch any common problem with values, types, that might arise, and handle in a graceful way outside of accumarray, this is not possible right now. Catching any "harmless" error now requires having a catch all statement, which also catches any sort of other serious runtime errors, memory errors, io errors etc, which you're probably not able to handle in a graceful way.
that all sounds good to me..
I stand by what I said above - I'm not going to touch anything for a couple of days at least so it should be easy for you to commit straight onto this branch
Ok, I'll see where I get with this the next days.
I guess the last open thing is a final decision in terms of variable naming, I got the names initially from here: http://wiki.scipy.org/Cookbook/AccumarrayLike . So it's not really my invention. As it's probably the oldest public python accumarray implementation (dating back to 2010) and was promoted on scipy, it likely got some reach. I'd rather stick with that, then inventing something completely new, without good reason. dest_idx
seems to me similarly descriptive, but also not really an improvement over accmap
. Maybe one more suggestion how to see accmap in a more speaking way: Different values in accmap mark and group regions within a, just like different colors mark countries in an ordiary map.
a
is besides being used in the cookbook quite commonly used throughout numpy and scipy for any array like input. cumprod
, all
, copy
, zeros_like
and most linalg, just to name a few, use a
for that. Sticking with this naming scheme looks like the best idea to me, if there is no clearly better name. If we want to stress the fact, that we also support scalars as second argument, following this logic we might use x
.
For some unclear reason I had taken the fillvalue without underscore, even if it's commonly used with it. This should be fixed into fill_value
.
About the order=
keyword, I guess this should not be added here. The conversion from C to F is cheap and easy, and can be done manually afterwards, if in doubt. This would save one more keyword. If I saw it correctly, there wasn't any logic implemented for it so far. For the input, C/F can be detected manually and acted upon accordingly. Maybe the output should just copy the ordering from the accmap?
impl_dict=
shouldn't be part of the public API as you already wrote in the docstring. So I guess you don't mind prefixing an underscore there.
About the keyword order, I initially would have tended to a agree to keep matlab ordering. Thinking about it now, I guess this should be talked about again. I was once told, when I suggested some accumarray thing for numpy, that numpy is not matlab, and does not need to repeat it's quirks. func seems to me to be clearly the most used keyword argument, nearly being worth to be a mandatory, positional argument. size
is something useful, but probably even less frequently used then fill_value. So while we could keep the order for fill_value and size in matlab style, we may have func as the first argument, just to allow it to be used without explicitly writing func=
in front of it. And when we tend not to lean too close to the matlab implementation, it may be more speaking to use size=
instead of sz=
. All in all, the cookbook version syntax seems to be well thought through ...
If we could agree on all this, we would finally end up with:
def accumarray(accmap, a, func='sum', size=None, fill_value=0, dtype=None):
...
This are the conventions we talked about already, just to sum it up:
fill_value=None
triggers downscalingAre you fine with this?
In no particular order...
I suspect that the cookbook accumarray
hasn't been that widely adopted by people because it would have been improved upon already..and there's not that much activity on the SO question. When we get this repo sorted out we should endeavor to get it linked to from the cookbook page so people dont find that page and think that's the best thing out there.
C
/F
- I'm not sure if we're on the same page here... I have used it in the calls to ravel_multi_index
and reshape
at the start and end of the function respectively. I thought it would be nice for the user to be able to specify the layout of the ndarray that accumarray
is creating from scratch. Admittedly the user can control this "manually" by passing in accmap[::-1,:]
in order to flip the order
of the output if need be, but that's a bit clumsy. Personally, my feeling is that order
is a well understood/easily-ignored kwarg
in numpy
, so I don't feel like there's anything wrong with providing it here.
With the first two args, I guess you're right that a
is used elsewhere in numpy, so maybe I should accept it. But I really don't like accmap
! ufunc.at
uses indices
, which is basically what I am suggesting too. Here's some more options: group_idx
grouping_idx
target_idx
out_idx
at_idx
or maybe a_label
a_group
a_idx
a_index
a_dest_idx
..anything else?
I'm not too fussed about whether or not we match up with Matlab beyond the first two args, as long as we provide basically the same functionality. Python's superior argument passing syntax is basically doing a pretty good job already - my feeling is that different users are likely to use the other kwargs
to different extents, so it's not that easy to provide a one size fits all optimal ordering...hence the thinking that maybe just sticking to Matlab's order is maybe not a terrible idea...but as I say, I'm not too fussed.
I'm not sure what you mean by "a[accmap >= size] is treated as nan"...where accmap>=size
, the boundary check should throw an exception, surely? If you did treat those values as nan
s then when would you put them in the accumulated output?
To clarify the nan
s in bools
and int
s question...it's only relevant for the fill_value
. And note that the user might want to choose +-Inf
as the fill value as well as nan
. My feeling on this may have changed a bit...perhaps we should upgrade to at least float32
if the user insists that they want non-finite values in their output array...both for bool
and int
. By default the fill_value
is 0
, so it would be the user's fault if they force the change in the dtype
. We could always implement it as a second step at the end where you convert form int/bool to float and set the fill_value
.
And what did you mean by the last point - "no implicit matching of a to accmap"?
C/F is not a big deal in pure numpy operations to support, as numpy takes care of all iteration issues related to it. Having to take care about this in C is quite an annoyance, I wouldn't like to take care of without good reason. Everyone really taking care about the order can easily change the order afterwards imho. I had a bug in the C version once, not correctly recognizing F arrays, producing wrong results, so that's why I said autodetection and acting upon must be granted. Nothing to do for the numpy version here.
Let's pick group_idx
then. Seems like a well descriptive choice.
What I meant with treat like nan was, ignore these values. If the size of the output is already given with size=
and not determined by the biggest value of group_idx
, there is no place in the output to accumulate anything there. It's like cropping the output. It's the current behaviour, nothing special. I just wanted to mention it here.
If dtype is specified, I guess we agree that specifying an invalid fill_value should simply raise a ValueError. If the user specifies dtype=int and fill_value=-inf, what else to do with it?
For more vague situations, fill_value=-inf for a bool operation, no dtype specified, there are points for both. I don't like too much implicit behaviour. Who knows, where the fill_value came from. Some calculation, that went wrong? A bool operation simply doesn't produce nan (or any other float) as an output. Neither should accumarray do, if not explicitly told to do so. Same with sums/prod on int and float values as fill_value. If the user insists on specifying +Inf as a fill_value, I'd expect at least the output dtype to be properly specified, so it's clear what should happen here. The analogy would be setting float values to an int. If it's a more or less sane operation, the value is rounded and set. If it's nan/inf, there is an error. Anyways, this is probably too much detail, we should probably skip this point, until there is some implementation for the ambiguation function.
No implicit matching, if the shapes don't fit together for a sane operation, there shouldn't be any fancy magic making it somehow work, but just a clear error.
One more thing. I totally don't care about my name being presented anywhere in the code, readme whatever, or someone copying. This is such a small snippet, not worth to claim any authorship on anything imho. So I'd suggest scratching any name mentionings in the readme and the code. The git history is descriptive enough, if someone really cares. If you'd prefer to have it in the code itself, I'd rather suggest something explicit for that like contributors.txt.
I don't like silently cropping for out-of-range indices. It's likely to hide bugs in users' code. If they really want to do that then they can do a simple masking before they pass the values in to accumarray - we could give an example of this in the docs. Alternatively you could provide a ignore_out_of_range
kwarg, so that it's very clear they are potentially losing data. bincount
uses the minlength
approach, which lets the user specify the minimum size and then returns a larger array if necessary...that's also a bit weird but at least you don't silently drop data. ..I'm saying we should throw an exception for out of range.
I think I agree with everything you just said about dtypes, let's see what the code ends up looking like and discuss further if need be.
With the order
I'm still not sure we're talking about the same thing- you can't autodetect the order of the output because it doesn't exist yet....you can autodetect the input, and you should, but I leave that up to you. Regarding the output, there are a few cases where the user would be grateful for being able to choose which dimensions are contiguous in memory:
I'm not sure how you were planning on doing the multidimensional indexing, but if you use ravel_multi_index
and reshape
then you actually don't need to worry about the order
in C
because you only have to deal with 1D indexing. If you're keen to write a single-pass loop over the data then I think it's fairly easy to inline the ravel_multi_index
- it's really just a dot product of the index with the shape of the output.
Regarding authorship - I'm fine with removing all references in code, but it would be good to leave one line in the readme somewhere...not a big deal though.
Have only one single large value (maybe due to overflow) in group_idx
, and your python interpreter is likely to die with some memory error as it can not allocate such a large output array. Fine with me, on the other hand, cropping this is exactly the desired action of specifying size at all. It prevents unwanted memory overflows, if you already know your data. If cropping is undesired, have fill_value=None
to downscale unreasonably large group indices.
About ordering for ndim output, I would have defaulted to C and left it to the user to wrap it into np.asfortranarray(accumarray(...))
if desired. I don't work with fortran, so the only time I had to deal with fortran ordering was when importing some matlab data. It's quite common in numpy to have it as an output option, so well then, let's include it.
Cropping is easy to use and easy to understand, but I feel very strongly that silently dropping some data is a recipe for disaster...I don't know what kind of data you deal with normally, but in my field you often find weird data that needs to be carefully dealt with rather than simply having "bad" values "swept under the carpet"...using fill_value=None
doesn't solve this because then you don't know what each of the output bins corresponds to. The behaviour I'm suggesting is inline with Matlab, so I don't think it's that odd a choice..but if you really want to offer the option of dropping these values, then I'm absolutely fine to make it a kwarg - I can add in a masking step at the start of the code, or just throw NotImplemente
.
For bincount, I'm not getting the point of the minlength argument at all, to be honest. I haven't used matlab in a while and don't have a working installation here. If it simply throws an error for group indices out of range ... why not. I just had checked our codebase, within roughly 400 different calls to accum, not a single one used the size keyword. I had basically implemented it as it was there in matlab, though the details aren't well described in their manual pages. I guess it makes more sense in a ndim context anyways.
In terms of downscaling, getting the corresponding original indices is possible using unique
, as described above. Maybe we should include return_indices=
for that case? Or for the sake of KISS, drop downscaling completely, and having accumcompressed or something like that offered as a wrapper function instead?
This may well fit together with some other little helpers we're currently using like uaccum, for accum and unpack in one go, and a bunch of other utils.
Looks like we're pretty much through then. One last thing, I'd be a big fan of staying with accum
as a name. It's just much shorter, and array
isn't carrying any new information anyways. What else to accumulate if not any sort of array? The only reason to stick with it would be, that matlab does it. And matlab itself is not exactly well known for an elegant naming scheme and consistency. accum
seems to be more the concise numpy way of naming.
Regarding size
- I think that just goes to show that there are many ways and different reasons to be using accumarray
..it's such a versatile function! [The minlength
in bincount
is a bit odd, as I say, though if you cap your indices before passing them to the function it is kind of useful because you are then basically specifying the size
. ..and specifying size is useful if there is a well-defined, finite, number of groups, but you don't necessarily have data for all of them.]
I agree with the concept of a helper library. Potentially you could attach them as attributes to the main function, e.g. accumarray.dropping_empty_groups(...)
or whatever - it could be nice in terms of readability, and it's similar to numpy
's ufunc.at
, ufunc.reduce
etc. But maybe it's easier to keep it separate from the main function - then we don't have to be so strict with API definition and testing etc. ...I did find your choice of naming a bit unintuitive, though I guess it's difficult to describe such a specific operation in so few words.
Anyway, to address the question of accum
versus accumarray
...I agree that Matlab chose a stupid name. It should have been group
, or aggregate
or something, not accum
, because that implies summing only. And yes the array
doesn't add anything. And it annoys me that Matlab abbreviates some things and not others [the worst culprit has to be their function prctile
]. But I feel like there is a lot to be said for sticking with their name...people looking for it will try looking for accumarray
first (e.g. the SO question, the name of the scipy
cookbook entry, even the name of this repo!), even if they would potentially try a more general search afterwards....even people who have used it, but can't remember where they found it would probably prefer not to have to remember another name for it. Also, if you did call it accum
, you have the problem that it makes it sound like it only does summing (which as I say, is a problem with the Matlab name as well, but it would be more confusing when it's just accum
by itself)...really, I think it comes down to a question of "branding": I'm happy to piggyback off the "brand"/"concept" created by Matlab, I mean people always seem to refer to the Matlab function when discussing this so I feel like it's probably worth not confusing the matter by changing the name.
Hmm, I just feel like there will be a time after matlab, and probably it has already started. So it may be a good time to get rid of some old luggage. I don't really like to compromise with any matlab shortcomings. Pandas goes with group
. I don't know if this name is already used in the numpy context, and it's not clearly describing, what happens either. aggregate
seems like the best fit to me. Maybe a bit long.
It's fine when the module keeps accumarray as the name. This way someone finds it, when looking for this functionality. Inside the module I suppose, we're free to pick another more fitting name, more in line with good naming practice and numpy conventions.
After writing the previous post, I thought about it a bit more and decided that really I would also prefer aggregate
.
If we go with that though, then I think we should try and drop all mention of "accumarray", except when referring explicitly to Matlab...so maybe rename the repo to py-aggregate
or something...I leave that to you.
Agreed with removing all mentions of accumarray. I just wouldn't like to rename the repo right now. Google search etc. any accumarray links point here. Maybe later.
I see there's also this other question that you've answered...looks like matplotlib.mlab.rec_groupby
is also something to look into.
When breaking with matlab traditions ... something like aggregate(group_idx, a).nansum()
could be nice. And maybe also providing ufunc.at syntax ... A wrapper for matlab syntax could also be provided. Maybe I need to rethink the layout a bit more, before refactoring all the code. So that we can have one clean implementation, and also can easily provide different interfaces for it.
Hmm, I'm not sure you gain much syntactically by doing that.
In terms of optimization, I guess there may be something to be said for creating an aggregate
object and then executing commands on it - things like sorting, removing nans, calling ravel_multi_index
etc. could be done once and then you can call multiple things like min
and max
, etc...but that is getting quite complicated and probably not a particularly common useage case. It might be easier to support func=('min,'max')
, i.e. if func is a tuple then return a tuple giving all the requested aggregations. You could then super-optimise your weave
stuff, to do everything in a single pass if you really wanted to!
I'd rather throw out the weave stuff asap ;) Nevertheless, yes, there may be more room for improvement. E.g. collecting max(group_id) would only be needed to be done once. Though, no one guarantees, that nobody messes around with group_id in the mean time between different calls ... suppose we should stick with the syntax discussed so far, and see if we want to add something later.
Well, I do quite like the idea of offering tuple/list support for func
...it's easy enough to detect and throw NotImplemented
if need be, and there are a few optimisations to be had in numpy
, though the real gains would be in weave
...presumably, computing the min and the max simultaneuously would only take a fraction longer than computing just min or just max...though the code might start getting complicated if you wanted to take advantage of that...anyway, I don't think I'm going to touch the weave
stuff myself, so it's up to you.
I'm not against it, though I don't plan to implement any extensions to the C code besides the 2D indexing in the near future. It's a quite messy thing anyways already.
Most testing and benchmarking is mainly working again now after the refactoring. I was a bit surprised about the bad pandas performance compared to the C implementation, as they claim to be fast. Initially I thought, maybe the construction of the dataframe would be a big overhead, but doubling the array size nearly doubled the pandas time. So the initial setup time seems to be quite small, close to being neglectable, compared to the iteration time, and pandas still being a magnitude slower than the C version. Maybe they suffer the same problems like ufunc.at in having a generic solution.
I also added the rudimentary start of my numba implementation. Don't know exactly anymore where I had left it. It was some numba versions ago, and I had stopped working further on it, as there were some annoying issues with numba, that I didn't want to bother with further. Chances are, this is fixed now already. Would be worth to give this another try now.
My plan for the next days now: I want to have the current features and testing working again. Most of the xfails should be gone first. After that, I'd go on with adding 2d group_idx for C etc. ideally in a testdriven way, so designing tests first, implementation later.
It's all looking nice so far.
I think I've actually got a working version of numba installed, so I should be able to play with that as and when it's ready.
One thought...if people don't want the whole repo will all the implementations and testing and benching etc. is there going to be an easy way for them to just download a single file, or perhaps two files..I guess we could suggest copy-pasting the contents of utils.py
at the top of your implementation of choice, replacing the from utils import ..
. If that is the best way of doing it, then it would be worth making sure we don't use import something as something_else
, because that would stop the copy-paste from working ..I notice that you did this with aliasing_numpy as aliasing
.
Getting the dtype
s right is definitely a bit of a messy question. I notice that there are a few lesser-known functions in numpy which might be relevant, see this Note in the docs. We should probably follow their conventions as far as possible.
I don't like copying the content of utils into every single file at all. Too much duplicate code there, way too much. The whole idea of utils was the reduction of duplicate code, and it's by far not yet finished. It may be worth to make a standalone gist for single implementations and only the most necessary code, once the code is somewhat stable, yes. Especially the numpy only solutions may be relevant there.
Importing aliasing_numpy
directly or renamed, it will be only exactly one line to change in any case after the copying. Actually I had renamed it exactly for the sake of better interchangeability of the remaining code. I wouldn't want to worry about that too much right now. There will probably be no automated process to make that gist, but it also won't take more than 3 minutes, to copy it together, if required.
You're talking about np.can_cast
? Lovely function, indeed. Didn't know about that. Had my experiments with using dtype.type
as you've probably seen. I'll leave it like that for now, but having can_cast
in there may make the life quite a bit easier.
Btw, one of the gains of the other syntax: aggregate(group_idx, a).std(ddof=1)
would easily possible, without passing all kwargs to any subfunction. The functor could receive it's own parameters in a clean way.
I'm not convinced - it looks more complicated that way and I think the code will end up being more complicated. Isn't ddof
the only extra arg for the main set of functions we have? If so couldn't we have some special code that passes it through where necessary or throw an Exception otherwise. Alternatively you could implement it as a setting: aggregate.ddof = 1; aggregate(group_idx, a, func='std')
. If you keep it the "simple" way it's easier to support arbitrary functions, although you can of course invent a syntax for that too: aggregate(group_idx, a).using(custom_func)
or something.
Incidentally, I was looking at optimising min
and max
..I haven't managed to do any better than ufunc.at
, although if you wanted to compute both min
and max
it would only take about 10% longer than computing just one or the other...at least for the benchmarking inputs I've been using.
I'm sure there are a few other similar pairings, like mean
, sum
, and var
/std
which can be co-computed more quickly than if they are done separately....as I said before, you could take advantage of this with either syntax, but I reckon it would be a lot easier with the fun=<tuple_of_desired_outputs>
syntax, because then you can see all the desired outputs in one go and work out the best way of doing everything there (which might simply mean looping over each func spearately, but could involve one of these numpy optimisations, or a more fancy C optimisation)..if you were using the aggregate().sum
syntax then you need to cache stuff inside the object and pllay all kinds of games to second guess what the user is trying to do.
To summarise, I'm not completely against it, but I would prefer the simple way.
For reference, the min
+max
optimisation, is as follows...
get the indices from argsort(a)
, which takes almost as long as running ufunc.at
. You can then apply those sorting indices to both group_idx
and a
, and use that in conjunction with the first
and last
tricks...or if you already know the counts, you could maybe use the [<you can't use counts..that's for a different kind of sorting]cumsum(counts)
trick...but I don't think that's much quicker.
(Note my previous comment had a logical mistake which I've edited out)
With the copy-paste thing, I wasn't suggesting that we do it - I definitely think we should have a separate utils
file, I was just thinking that we could put a note near the top of the readme, recommending, that if the user wants one simple file that they can do the copy-paste themselves rather than downloading everything. The gist seems like a good option too, although a bit of a hassle for us, especially if we were going to maintain copies of more than one implementation.
The func-tuple would also have the benefit, that it would be clear how long to keep temporary data. No problem with invalidating a cache, changes in the array, while cached data wasn't invalidated and such hazzle. If we really wanted to do that, all the functors should become classes, so that they could have attributes which temporary results they produce/require, and that the dependencies could be calculated optimally.
Though, it would be a major hazzle to have that elegantly in C in maximum performance in this generality. I won't do that now, also not in the near future, and afterwards the current numpy C-API will likely be completely dropped. For the other implementations, I see much less optimization potential, as most of the speedup would come from making many things in one go, within one single iteration. This detail would be only possible by hand-crafted auto-generated C-code, and maybe, only maybe, with some numba fanciness. In the end, I don't think it is worth it right now.
All in all, I don't think any further syntax change makes too much sense right now. Neither the one I had talked about, nor the tuple thing. So let's just see that as a kind of thoughts collection, and get the current stuff back working, fix the not implemented things as good as possible, and make it the new master version.
Ok, I've put the multiple functions thing in a new issue...https://github.com/ml31415/accumarray/issues/3
Ok, done for today. Benchmarks and tests working again. Tomorrow 2d for C funcs + tests. Current benchmarks:
function py ufunc numpy weave pandas
-----------------------------------------------------------------------------------
sum 447.804 29.212 1.534 1.692 29.341
amin 464.312 31.669 31.343 2.074 29.153
amax 463.859 31.679 31.512 2.043 28.929
prod 458.839 30.250 30.083 1.597 29.277
all 426.466 35.240 1.802 1.564 1413.038
any 421.425 34.422 4.695 1.465 1421.063
mean 456.021 ---- 2.472 1.565 24.063
std 581.687 ---- 4.565 2.091 410.885
nansum 566.902 ---- 2.940 1.568 25.872
nanmin 578.444 ---- 32.391 2.099 25.857
nanmax 579.066 ---- 32.849 2.127 29.285
nanmean 624.522 ---- 3.752 1.700 25.133
nanstd 699.913 ---- 5.918 2.217 409.360
anynan 463.876 ---- 1.066 0.995 873.658
allnan 439.034 ---- 3.479 3.375 874.282
Looking at all
versus any
for numpy you can see that there is quite a big difference depending on the actual values in the inputs...(those two funcs are basically doing the same thing).
Ideally, we should have a range of different input values and shapes etc. in order to get a better sense of benching.
Indeed somewhat strange. Right now it's also not such a useful test at all for the any
/all
, as there are no False
values in the set, just some random float noise.
Nice to see, what a little path optimization can do. Performance crown for all disciplines back to weave
I'd say. Though admittedly, for sum
the results are so similar now, looks like I managed to produce quite the same machine code as bincount. all
, any
, allnan
and anynan
are about 15-20% faster now. I had tried the same approach for min
and max
, but there it turned out as a performance penalty.
The current implementation profits, if the fill_value
equals the neutral element of the operation, 0, for sum
, and any
, 1 for prod
and all
, as setting this can be skipped in that case.
function py ufunc numpy weave pandas
-----------------------------------------------------------------------------------
sum 459.352 29.488 1.550 1.533 28.357
amin 471.674 33.471 31.318 2.046 28.621
amax 479.402 31.737 31.589 2.122 28.598
prod 464.380 30.633 30.023 1.550 29.708
all 431.441 35.358 1.797 1.166 1434.810
any 426.978 34.184 4.720 1.232 1502.974
mean 454.718 ---- 2.395 1.914 23.165
std 584.959 ---- 4.518 2.109 411.108
nansum 560.355 ---- 2.806 1.443 25.624
nanmin 581.653 ---- 32.371 2.081 25.305
nanmax 584.505 ---- 33.268 2.922 25.701
nanmean 568.045 ---- 3.794 1.637 27.326
nanstd 697.972 ---- 5.791 2.198 412.248
anynan 460.697 ---- 0.966 0.939 865.429
allnan 446.610 ---- 3.345 3.326 866.557
Yes, lookin' good.
Do you know whether you're getting full SIMD vectorisation? I know numpy doesn't always manage it - https://github.com/numpy/numpy/pull/5144 - at least 1.9 doesn't. I'm trying to understand why min and max can't quite compete with sum
...maybe there is a real hardware/logical reason for it?
pandas
is doing miserably on some of those funcs...even purepy is beating it in many cases, that's ridiculous. Maybe we could ask someone from pandas
to take a look and see if they have any thoughts!?
Are you still planning on doing the numba
stuff? Even if the results are similar to weave
I guess there's something to be said for having both given that people (such as myself) may only have one or the other installed and working.
btw...did you catch that comment I left on utils.get_func_str
...I didn't like idea of reading the name of arbitrary functions. I thought it would be better not to attempt to guess what a user-supplied function is doing simply based on its name.
Good morning Daniel! Yes, I saw that comment. I'd be fine to only catch well known numpy functions. I used to use bottleneck for some operations, I'd probably add their functions to the detection as well, if they are importable, so not being dependent on them.
I can imagine the pandas guys definitely want to iron that out, maybe we should just open a bug report there. Either now or after we made this branch the new master. I saw already, that you're very good at that ... ;)
I'm quite interested in getting the numba version done in some mid term perspective. The C API for numpy, that weave
is using is deprecated, so with numpy 2.0 afaik it will be gone. That means, I need another speedy replacement. numba is the only candidate, which I hope could end op with similar results as weave.
In terms of SIMD, I turned -O3
on now, which is supposed to allow vectorization. I though that would have been default for weave, but it wasn't. Looks like this gave another little speedup. Using a smaller int type for group_idx should profit more I suppose.
function py ufunc numpy weave pandas
-----------------------------------------------------------------------------------
sum 447.070 29.095 1.564 1.425 28.815
amin 457.130 31.428 31.184 2.031 28.935
amax 460.381 31.632 32.000 2.020 28.819
prod 458.022 29.901 30.961 1.437 28.538
all 428.311 35.172 1.800 1.119 1466.945
any 421.640 36.417 4.705 1.190 1417.496
mean 455.721 ---- 2.382 1.575 22.925
std 584.937 ---- 4.471 2.132 409.972
nansum 567.291 ---- 2.826 1.444 25.763
nanmin 577.232 ---- 32.418 2.127 25.739
nanmax 580.206 ---- 33.473 2.105 25.755
nanmean 572.432 ---- 3.652 1.616 25.046
nanstd 701.725 ---- 5.747 2.178 410.060
anynan 461.680 ---- 0.949 0.956 857.830
allnan 440.435 ---- 3.358 3.328 856.977
The previous test wasn't that fair. I unfortunately had used Timer.timit(10)
, which runs everything, including setup code and counts that together. This means e.g. creating 10 DataFrames for pandas. I changed the test to take best of 3 now, and kept the instruction cycles the same. Pandas is doing a fair bit better now.
function py ufunc numpy weave pandas
-----------------------------------------------------------------------------------
sum 425.832 26.158 1.452 1.235 16.606
amin 414.143 27.439 27.371 1.388 16.993
amax 417.362 27.863 27.842 1.413 16.707
prod 427.957 26.988 26.913 1.247 16.503
all 403.089 31.748 1.614 0.980 437.865
any 400.536 30.722 4.581 1.057 434.577
mean 423.542 ---- 2.179 1.374 9.432
std 534.851 ---- 4.993 1.667 131.912
nansum 538.288 ---- 3.078 1.310 11.024
nanmin 540.087 ---- 29.460 1.506 11.232
nanmax 538.084 ---- 29.443 1.554 11.185
nanmean 544.123 ---- 3.833 1.412 10.999
nanstd 653.580 ---- 6.613 1.722 131.985
anynan 432.374 ---- 0.878 0.865 267.062
allnan 403.610 ---- 3.507 3.514 266.326
With SIMD I guess it shouldn't really be a surprise that there's not much change - neither LLVM nor GCC claims to do any magic for multiple simultaneous reductions, only for single reductions.
http://llvm.org/docs/Vectorizers.html#the-loop-vectorizer
https://gcc.gnu.org/projects/tree-ssa/vectorization.html#slp-reduc-2
I don't think a smaller data type for group_idx
will help with SIMD because I think all indexing operations are converted to 32/64 bit integers, but it will possibly help with memory bandwidth..less data moving around means it gets to the cpu faster..though reading group_idx
from memory may not be the bottleneck.
One thing you could try, if you were really serious, is for the case where the size of the output is pretty small. In that case, you could duplicate each bin, giving one slot for each of the elements within a SIMD register, i.e. for sums your output might be: [0_a 0_b 0_c 0_d 1_a 1_b 1_c 1_d .... n_d]
. If the whole output vector still fits into L1 cache (probably 32KB), then I think you probably won't loose much in terms of read/write speed, but you can then compute at 4x faster (or however many elements fit in your SIMD register). At the end you have to run over the output vector once, doing a final reduction of the 4 contiguous values into 1, but that's definitely SIMDable, and since it's small and in L1 it should run super fast...and plus in some cases in your code you have to do that final pass-over already.
I meant to say in the previous comment that one obvious case where the above optimisation is probably used is in image libraries, where they produce histograms for int8 data on each channel...might be worth trying to find an example and see if they do indeed do that.
I'm also not really an expert for this bit fiddling. Not sure if it's worth to dig here further. So far I'm more or less fine with the current speed. If it should be even faster, it would probably better throwing that into the graphics card and getting it multithreaded first. Current benchmarks:
function py ufunc numpy weave pandas
-----------------------------------------------------------------------------------
sum 461.919 26.215 1.518 1.297 17.298
amin 458.539 28.502 27.387 1.279 16.884
amax 456.517 27.840 27.952 1.428 16.679
prod 457.963 27.116 27.008 1.317 16.549
all 450.560 31.816 2.992 1.899 432.901
any 448.340 33.128 4.941 1.769 433.360
mean 486.738 ---- 2.177 1.393 9.425
std 575.171 ---- 5.081 1.687 130.671
nansum 471.691 ---- 4.765 1.780 11.637
nanmin 471.319 ---- 25.434 1.998 11.628
nanmax 472.273 ---- 25.972 2.077 11.703
nanmean 605.367 ---- 5.280 1.887 11.770
nanstd 770.958 ---- 7.475 2.151 130.542
anynan 406.154 25.931 2.261 1.718 266.696
allnan 405.386 27.845 3.780 1.707 268.729
[I was about to post the following, before I saw your comment]
Ok, well since I've dug myself a bit of a hole, here's a proper paper dealing with this properly: http://pcl.intel-research.net/publications/histogram-pact2014.pdf The relevant sections are pretty short - Section 2.2 basically suggests the same thing I did when the number of bins is very small, but Section 2.3 suggests something more helpful (and fairly obvious) which is to use multiple threads... With multiple threads there are two main options:
numba lets you do multithreading apparently..dunno about weave http://numba.pydata.org/numba-doc/0.15.1/examples.html#multi-threading
Btw. did you manage to get weave to run with windows?
Pushed some more changes now. Some test cases for multi-indexing are still missing. Besides that, we should be quite good to go for making this master.
I've created a separate branch for working on this.
So far I've spent a few hours trying to write a fairly comprehensive readme plus I've done some very basic work to split up the implementations into separate files. Note though that I've made basically no attempt at actually enforcing the API yet, or even checking if the code runs ok...it doesn't!
Obviously there is still quite a bit of work to be done.
In terms of agreeing on a single API, I think the main questions are:
1. What to do with
dtype
s? It's always nice to give the user full control over this, but it's also great when the program is able to automatically choose the most sensible option. For example, summingint8
's probably need to be upgraded toint32
orint64
, maybe evenfloat
..the program could guess what is best based on some heuristic, although this would add extra time - the simplest heuristic would be to assume the worst case i.e. all values are already maximum at the current bitdepth and all will be added to the same group, thus iflen(idx)=100
andvals.dtype=int8
then we assume thesum
is going to need log2(100*255) bits. This calculations is specific tosum
,mean var std
have different requirements, andmin max first last all any
are easier to guess. The other question is what to do about the dtype of the fillvalue - I have chosen to enforcebool
s for thefillvalue
ofany all
but, the user may legitimately want some form ofnan
or pseudo-nan (I remember reading a discussion about efficientnan
s for non-floats somewhere in pandas docs or maybe it was to do withnumpy
'sMaskedArray
..anyway there's no simple answer to that question). If the fillvalue does require usingfloat
in the output, it may still be more efficient to do the main computation with some for ofint
and then only upgrade tofloat
at the end. Perhaps the easiest thing to do is upgrade all the ambiguous stuff to at leastfloat32
and then encourage the user to request a simplerdtype
if they fell that would suffice...but then what about detecting overflows etc..I'm not sure what error handling of that kind is built intonumpy
?2. Support for non-flat
vals
, matched toidx
Although matlab allows this, I would prefer not to. Assuming bothidx
andvals
have the same memory layout, it costs nothing for the user to flatten them, and it isn't too ugly syntactically. By enforcing that it then makes it simpler to identify when 2Didx
is being used to request multidimensional output...this was always a bit confusing in matlab...see next point...3. Support for multidimensional output, using 2D
idx
This is available in matlab and is a very helpful feature. I use it a lot for 2d binning, but it could be n-dimensional. I feel strongly that we should provide it, at least in the API (i.e. throwNotImplemented
if need be..but it's actually very easy to do once numpy is involved!).4. Support for scalar
vals
- I wrote a small section on this in the readme. I think it's not really that useful in general, but helpful forsum
ofvals=1
...you might argue that the user could just go and usebincount
, but Matlab accepts scalars and this version ofaccumarray
can help with multidimensional bincounting, so it's probably a feature worth having. If need be, you could put a simple hack at the top of the function, which expands the scalar out to be a full vector..or throwNotImplemented
.5. Support broadcasting
vals
toidx
This is complicated to implement and causes ambiguities if you also want to support 2Didx
for multidimensional output (disucssed above). Matlab does not offer this, and I've never felt the need to try and use/implement it, so I strongly feel we should explicitly choose not to do it. It also means that theufunc.at
optimizations (if they ever happen) can be kept simpler....to be honest I'm not even sure whether broadcasting makes any sense in theaccumarray
context?!6. Support for contiguous/sorted mode. I haven't mentioned this in the readme and I haven't implemented it yet, but I agree that something along these lines would be nice. There are a variety of things you might want, for example if the user says that
idx
is pre-sorted, various optimisations may be possible (i.e. the code still treats the inputs the same way, but is now able to assume that groups are already laid out contiguously in memory). Alternatively the user may want your contiguous concept, whereidx
is no longer an actual index, but now just a general label for contiguous sections. A variation on this would be to say that sections whereidx=0
, should be treated as belonging to a common "null" group, but all other contiguous blocks should be treated as being labeled1, 2, ..., n
. ..this can work nicely ifidx
is a bool array, as eachtrue
block is treated as a new label, but all thefalse
blocks are grouped together...but sometimes the user will want a way to have two different groups adjacent to one another (i.e. nofalse
element between them)..which you can't do with a simplebool
array....anyway, the point is that there's a lot you could do here to make the user happy, some of which will offer extra optimizations, and some of which would just allow the user to be lazier.7. User-specified size and bounds checking - I think it's definitely worth allowing the user to specify the size of the output, Matlab does this. I often need to do this when I'm doing analysis because I use accumarray twice, the second time with a subset of the original data, and I want the outputs to be the same size. I'm not sure exactly how much control you can give the user over bounds-checking...I guess with your
weave
stuff you could completely disable all checking if the user really wanted. If that's something you want to offer we should encapsulate it in a properkwarg
, however I think in general people will expect basic checking as I think allnumpy
code does. edit 0: by the way I don't like your choice of"incontiguous"
as default...the english is _non_contiguous, but it's a bit of a confusing choice of default. I also haven't looked into the details what yourdownscaled
mode offers...I'm not sure where it fits in to the picture.edit 1:
Some of the above stuff probably warrants common scripts for
numpy
andweave
, though pure python solutions would have to be kept separate (and in many cases may be simple/not applicable at all.). I'm not sure of the best way to deal with that given that we want to try and keep stuff as separate as possible...there may need to be some kind of compromise.One thing that should be fairly easy to agree on is the order of args, I vote for following Matlab exactly and then adding on all the extra
kwargs
after that (order, dtype, mode
).Also, we should agree on the aliasing rules and exactly what the nan- versions of functions are supposed to do. I tried to describe that fairly explicitly in the readme...let me know if you disagree and/or want to add stuff.
Regarding the name of the main two arguments, I definitely prefer
idx
andvals
, but if we permit a range of modes thenidx
is no longer necessarily such a great choice...but I'll stick with it for now!Regarding the
weave
stuff, is there a good place to point people to in order to get it installed quickly...so far I haven't said anything about this in the readme...it's probably worth giving people a one-sentence explanation as to exactly whyweave
is so good.I think you also mentioned a numba implementation..what did you say the status of that was? And what about a Cython one?..although I'm not sure there's going to be much to be gained from Cython. I have a small amount of experience with Cython and a tiny amount of experience with numba, but I don't feel like I really know enough to compare the pros and cons of each...if you are going to provide all these implementations, we should have a proper explanation as to why you might want each of them....I guess one of the advantages worth mentioning is that (some/all? of these things) work nicely on the GPU...though I think the simplicity of going to the GPU varies greatly?
In terms of making the C versions more useful to people, might it be worth providing some binaries..I guess that's a pretty big job, but it might increase uptake.
edit 2:
Ok, I didn't realise that the the non-numpy implementation did actually use numpy. ...I've now re-written it to be truly pure-python. It's relatively simple, though obviously not exactly performant.
edit 3:
I think it may be worth trying to wrap
pandas
. It looks like pandas uses cython under the hood for its groupby- see ..pandas/ib.pyx and ..core/groupby.pyedit 4:
Right, I've implemented a simple
pandas
version. I didn't spend any time trying to squeeze the best performance out of pandas, so it may be possible to do better. I've put benchmarking stats in the readme, but the basic story is that it doesn't beat the optimized numpy version, except where numpy has to rely onufunc.at
(ie.min max prod
).I've also implemented a ufunc-at version, for use with benchmarking.