ml31415 / numpy-groupies

Optimised tools for group-indexing operations: aggregated sum and more
BSD 2-Clause "Simplified" License
194 stars 20 forks source link

Create a single API/test/benchmarking but with multiple independent implementations #2

Closed d1manson closed 9 years ago

d1manson commented 9 years ago

I've created a separate branch for working on this.

So far I've spent a few hours trying to write a fairly comprehensive readme plus I've done some very basic work to split up the implementations into separate files. Note though that I've made basically no attempt at actually enforcing the API yet, or even checking if the code runs ok...it doesn't!

Obviously there is still quite a bit of work to be done.

In terms of agreeing on a single API, I think the main questions are:

1. What to do with dtypes? It's always nice to give the user full control over this, but it's also great when the program is able to automatically choose the most sensible option. For example, summing int8's probably need to be upgraded to int32 or int64, maybe even float..the program could guess what is best based on some heuristic, although this would add extra time - the simplest heuristic would be to assume the worst case i.e. all values are already maximum at the current bitdepth and all will be added to the same group, thus if len(idx)=100 and vals.dtype=int8 then we assume the sum is going to need log2(100*255) bits. This calculations is specific to sum, mean var std have different requirements, and min max first last all any are easier to guess. The other question is what to do about the dtype of the fillvalue - I have chosen to enforce bools for the fillvalue of any all but, the user may legitimately want some form of nan or pseudo-nan (I remember reading a discussion about efficient nans for non-floats somewhere in pandas docs or maybe it was to do with numpy's MaskedArray ..anyway there's no simple answer to that question). If the fillvalue does require using float in the output, it may still be more efficient to do the main computation with some for of int and then only upgrade to float at the end. Perhaps the easiest thing to do is upgrade all the ambiguous stuff to at least float32 and then encourage the user to request a simpler dtype if they fell that would suffice...but then what about detecting overflows etc..I'm not sure what error handling of that kind is built into numpy?

2. Support for non-flat vals, matched to idx Although matlab allows this, I would prefer not to. Assuming both idx and vals have the same memory layout, it costs nothing for the user to flatten them, and it isn't too ugly syntactically. By enforcing that it then makes it simpler to identify when 2D idx is being used to request multidimensional output...this was always a bit confusing in matlab...see next point...

3. Support for multidimensional output, using 2D idx This is available in matlab and is a very helpful feature. I use it a lot for 2d binning, but it could be n-dimensional. I feel strongly that we should provide it, at least in the API (i.e. throw NotImplemented if need be..but it's actually very easy to do once numpy is involved!).

4. Support for scalar vals - I wrote a small section on this in the readme. I think it's not really that useful in general, but helpful for sum of vals=1...you might argue that the user could just go and use bincount, but Matlab accepts scalars and this version of accumarray can help with multidimensional bincounting, so it's probably a feature worth having. If need be, you could put a simple hack at the top of the function, which expands the scalar out to be a full vector..or throw NotImplemented.

5. Support broadcasting vals to idx This is complicated to implement and causes ambiguities if you also want to support 2D idx for multidimensional output (disucssed above). Matlab does not offer this, and I've never felt the need to try and use/implement it, so I strongly feel we should explicitly choose not to do it. It also means that the ufunc.at optimizations (if they ever happen) can be kept simpler....to be honest I'm not even sure whether broadcasting makes any sense in the accumarray context?!

6. Support for contiguous/sorted mode. I haven't mentioned this in the readme and I haven't implemented it yet, but I agree that something along these lines would be nice. There are a variety of things you might want, for example if the user says that idx is pre-sorted, various optimisations may be possible (i.e. the code still treats the inputs the same way, but is now able to assume that groups are already laid out contiguously in memory). Alternatively the user may want your contiguous concept, where idx is no longer an actual index, but now just a general label for contiguous sections. A variation on this would be to say that sections where idx=0, should be treated as belonging to a common "null" group, but all other contiguous blocks should be treated as being labeled 1, 2, ..., n. ..this can work nicely if idx is a bool array, as each true block is treated as a new label, but all the false blocks are grouped together...but sometimes the user will want a way to have two different groups adjacent to one another (i.e. no false element between them)..which you can't do with a simple bool array....anyway, the point is that there's a lot you could do here to make the user happy, some of which will offer extra optimizations, and some of which would just allow the user to be lazier.

7. User-specified size and bounds checking - I think it's definitely worth allowing the user to specify the size of the output, Matlab does this. I often need to do this when I'm doing analysis because I use accumarray twice, the second time with a subset of the original data, and I want the outputs to be the same size. I'm not sure exactly how much control you can give the user over bounds-checking...I guess with your weave stuff you could completely disable all checking if the user really wanted. If that's something you want to offer we should encapsulate it in a proper kwarg, however I think in general people will expect basic checking as I think all numpy code does. edit 0: by the way I don't like your choice of "incontiguous" as default...the english is _non_contiguous, but it's a bit of a confusing choice of default. I also haven't looked into the details what your downscaled mode offers...I'm not sure where it fits in to the picture.


edit 1:

Some of the above stuff probably warrants common scripts for numpy and weave, though pure python solutions would have to be kept separate (and in many cases may be simple/not applicable at all.). I'm not sure of the best way to deal with that given that we want to try and keep stuff as separate as possible...there may need to be some kind of compromise.

One thing that should be fairly easy to agree on is the order of args, I vote for following Matlab exactly and then adding on all the extra kwargs after that (order, dtype, mode).

Also, we should agree on the aliasing rules and exactly what the nan- versions of functions are supposed to do. I tried to describe that fairly explicitly in the readme...let me know if you disagree and/or want to add stuff.

Regarding the name of the main two arguments, I definitely prefer idx and vals, but if we permit a range of modes then idx is no longer necessarily such a great choice...but I'll stick with it for now!

Regarding the weave stuff, is there a good place to point people to in order to get it installed quickly...so far I haven't said anything about this in the readme...it's probably worth giving people a one-sentence explanation as to exactly why weave is so good.

I think you also mentioned a numba implementation..what did you say the status of that was? And what about a Cython one?..although I'm not sure there's going to be much to be gained from Cython. I have a small amount of experience with Cython and a tiny amount of experience with numba, but I don't feel like I really know enough to compare the pros and cons of each...if you are going to provide all these implementations, we should have a proper explanation as to why you might want each of them....I guess one of the advantages worth mentioning is that (some/all? of these things) work nicely on the GPU...though I think the simplicity of going to the GPU varies greatly?

In terms of making the C versions more useful to people, might it be worth providing some binaries..I guess that's a pretty big job, but it might increase uptake.


edit 2:

Ok, I didn't realise that the the non-numpy implementation did actually use numpy. ...I've now re-written it to be truly pure-python. It's relatively simple, though obviously not exactly performant.

edit 3:

I think it may be worth trying to wrap pandas. It looks like pandas uses cython under the hood for its groupby- see ..pandas/ib.pyx and ..core/groupby.py


edit 4:

Right, I've implemented a simple pandas version. I didn't spend any time trying to squeeze the best performance out of pandas, so it may be possible to do better. I've put benchmarking stats in the readme, but the basic story is that it doesn't beat the optimized numpy version, except where numpy has to rely on ufunc.at (ie. min max prod).

I've also implemented a ufunc-at version, for use with benchmarking.

ml31415 commented 9 years ago

Wow, great job so far! Sorry for the delay, I was on a conference over the weekend. I'll go through step by step, hope I don't miss anything.

1. dtypes We should do our best to provide some reasonable defaults, and it has to be the same default for all implementations, so this needs to go into a separate function. On the other hand, I guess it's fair to just take int64/float64 for sum/prod, and let the user choose himself to reduce this, if he knows the result to be smaller than that. Any heuristic will require inspection of the data beforehand. The better the assumption shall be, the more time costly it is, and the shorter, the more error prone to overflows it will be. I wouldn't want to do that.

For bool and int, I wouldn't bother with a separate nan type. If in doubt, the user can run an anynan over the same data, and use this as a mask afterwards. End of story. In general, the fillvalue must be representable with the selected output dtype. So the fillvalue should also be an input for the shared dtype selection function.

2. Matching vals to idx Full acc. KISS principle.

3. ndim Yes, should be in the API.

4. Scalar vals The simple hack sounds reasonable. Not sure if I would bother spending much time in writing any optimized code for that rare case.

5. Broadcasting KISS principle, let's forget this.

6. Contiguous mode The downscaled thing is mainly doing a compression of the outputs, skipping all empty fields, by running unique over accmap first. It could be supported just like the scalar vals, by simply modifying accmap before acutally running the main action. It's a nicety, and I also won't bother to throw it out.

For the contiguous mode, this was my initial implementation, requireing an already sorted input. After hacking around a bit longer and improving the implementation for unsorted accmaps, it turned out, the speed advantages were actually not that great as I had thought, and I could get comparable speeds for arbitrary accmaps. It's basically one pointer indirection, that can get skipped, when accessing the data. Looks like the prefetching etc. is that great, that this is not a big cost anymore. As it is mainly a subcase of the unsorted map. It would make sense to throw that out completely.

Renaming incontiguous to something better, I'm totally fine with that. I'm not a native speaker, so I beg my pardon for the naming.

For your other suggestions on this, I'd say KISS principle again. If the user wants some 0-group, simply set some 0-values in the accmap, before running accumarray over it.

7. Bounds checking As the functions in the inner loop, adding simple values are really simple, the whole process is mainly memory bound. So adding a bounds check is not a great overhead. E.g. the nansum and sum run in nearly the same speed, even if nansum has an additional val != val comparison inside. So for the sake of simplicity, and as segfaults are really ugly, I wouldn't consider a possibility to turn bounds checking off. It should be safe to use the library, even if there were some spooky values in your accmap.

So, I'm totally fine adding the sz argument.

8. Input variable naming vals, idx, accmap, a ... I like speaking variable names, when ever possible. idx is quite generic, used million of times anywhere in any code, it doesn't describe well what it is meant for. Iterating over it idx[i] ... what do I get? accmap is more speaking. It's a map, how to accumulate vals, accumulation map. I'd clearly prefer that over idx. Matlab calls it subscripts, that may be another good option. vals vs a, both don't give any information, so they are equally bad I'd say. Though I wouldn't have any better suggestion on that right now, I'm fine with both names. Finding something better couldn't hurt though.

It may be a good idea to change fillvalue to fill_value, as used in numpy e.g. for masked arrays.

ml31415 commented 9 years ago

For getting scipy weave to run, it's actually a single line with linux, not more than "apt-get install python-scipy" and everything works. I also know, what kind of pain in the ass it can be, to get toolchains or math libraries installed in windows. I haven't used windows for ages, and can't help too much with setting up scipy for it. The documentation there should cover that I hope. What weave generally does, is creating C-code on the fly, depending on the input data type, and storing the compilation results for reuse. A working C compiler is necessary for all this to work, so there is no way to provide binaries, that would run without a working toolchain.

I also wouldn't worry at all about the purepy implementation. Any sane person, trying to do some math with python will install numpy. If not - not my problem. No need to reinvent the wheel, just for a demonstration version of accumarray. If some things don't work there, so be it.

In terms of aliasing, you had the clearer approach. We should have a shared function, of converting any func-keyword-input into a unique function name string, with which every implementation has to deal itself then. Your implementation is a good start for that.

d1manson commented 9 years ago

Your English is pretty much perfect (I actually spent a year in Moscow as an EFL teacher so I should know..it's certainly a lot lot better than my Russian!).

Ok...

I think we need to flesh out the accumarray_utils.py with the common numpy stuff...but bear in mind that I think it is a nice feature to support a fully non-numpy implementation...there are some environments where it might be a pain to get numpy working, but where it's nice to have some simple accumarray for smallish datasets..anyway the code is pretty short and simple, so I don't think it's a big deal...it's also quite fun to benchmark it against numpy!

1. dtypes bool - agreed. sum - it's pretty simple to get a safe bound just using len(vals) and the number of bits in vals, for example 30 000 int16s can be safely summed into an int32, or 60 000 uint16s...but, yes, I think we should put all the logic in one utility function and work on it there..with regards dtype+fillvalue, my _get_minimum_dtype function is a basic first draft.

6. modes So, are you suggesting that the only thing we should support is something like this:

if <whatever_the_special_test_is>:
   idx_is_used = np.zeros(sz,dtype=bool)
   idx_is_used[idx] = True
   new_idx = np.cumsum(idx_is_used) - idx_is_used
   idx = new_idx[idx]
   sz = max(idx)

If that's what you mean, then could we treat it as a special case of fillvalue, e.g. fillvalue=None, means drop all the missing values...although for multidimensional indexing you could still end up with some empty spots..maybe we could just say that it's not supported for multidimensional output. The only other reason not to use fillvalue=None would be for the func='array' or func='sorted' cases, because there the user may actually want to use None as the fillvalue...but I don't think the user will be too annoyed if they cant have that.

Just to clarify, if the above is true, then there is no mode needed at all, right?

8. variable naming I really didn't like a, but I guess you are right that vals is equally meaningless!...not sure what else to suggest there. But for, the indices, how about dest_idx (for destination index - this is in the spirit of C/asm-like naming when you provide output pointers as input args)..I felt like "accumulation map" was a bit confusing - it sounded to me either like the name of a function/lambda or else the output of the accumulation (rather than the input to it).

I'm probably not going to do any more work on this in the next few days, so you can safely commit to the branch I was working on and it shouldn't be too confusing.

d1manson commented 9 years ago

9. benchmarking and testing I'm sure your stuff was good, but I was lazy and ended up basically starting from scratch in a separate file rather than trying to understand everything you were doing. As discussed initially, we should aim to get testing to a pretty high standard.

ml31415 commented 9 years ago

With start from scratch you mean the test_and_bench_simple module? I agree that testing and benchmarking should be separated. I'm also fine when the testing part with pytest will be up to me. I guess it could require some more comments. It's surely not too intuitive, what's going on there. For basic understanding I'd recommend pytest fixture and parametrization reads.

accmap_all and accmap_compare are different fixtures. accmap_all simply returns different accum implementations for the tests. That are the first bunch of tests, testing basic functionality that has to be guaranteed by all implementations. This could also be achieved, by parametrizing all these tests separately with all implementations. Parametrizing the fixture just does this in a more convenient way.

accmap_compare provides a whole namespace with a bunch of testing ingredients, data with and without nan, depending on what shall be tested, a reference function etc. test_compare then runs all the listed functions in func_list against any combination of implementations provided by the accmap_compare fixture.

ml31415 commented 9 years ago

fillvalue=None would be fine to me. This is what is done right now in that case:

accmap = np.unique(accmap, return_inverse=True)[1]
ml31415 commented 9 years ago

Btw. please let's be more specific with the thrown errors in the future rework. It's really bad practice to only throw Exception. When I would like to catch any common problem with values, types, that might arise, and handle in a graceful way outside of accumarray, this is not possible right now. Catching any "harmless" error now requires having a catch all statement, which also catches any sort of other serious runtime errors, memory errors, io errors etc, which you're probably not able to handle in a graceful way.

d1manson commented 9 years ago

that all sounds good to me..

I stand by what I said above - I'm not going to touch anything for a couple of days at least so it should be easy for you to commit straight onto this branch

ml31415 commented 9 years ago

Ok, I'll see where I get with this the next days.

I guess the last open thing is a final decision in terms of variable naming, I got the names initially from here: http://wiki.scipy.org/Cookbook/AccumarrayLike . So it's not really my invention. As it's probably the oldest public python accumarray implementation (dating back to 2010) and was promoted on scipy, it likely got some reach. I'd rather stick with that, then inventing something completely new, without good reason. dest_idx seems to me similarly descriptive, but also not really an improvement over accmap. Maybe one more suggestion how to see accmap in a more speaking way: Different values in accmap mark and group regions within a, just like different colors mark countries in an ordiary map.

a is besides being used in the cookbook quite commonly used throughout numpy and scipy for any array like input. cumprod, all, copy, zeros_like and most linalg, just to name a few, use a for that. Sticking with this naming scheme looks like the best idea to me, if there is no clearly better name. If we want to stress the fact, that we also support scalars as second argument, following this logic we might use x.

For some unclear reason I had taken the fillvalue without underscore, even if it's commonly used with it. This should be fixed into fill_value.

About the order= keyword, I guess this should not be added here. The conversion from C to F is cheap and easy, and can be done manually afterwards, if in doubt. This would save one more keyword. If I saw it correctly, there wasn't any logic implemented for it so far. For the input, C/F can be detected manually and acted upon accordingly. Maybe the output should just copy the ordering from the accmap?

impl_dict= shouldn't be part of the public API as you already wrote in the docstring. So I guess you don't mind prefixing an underscore there.

About the keyword order, I initially would have tended to a agree to keep matlab ordering. Thinking about it now, I guess this should be talked about again. I was once told, when I suggested some accumarray thing for numpy, that numpy is not matlab, and does not need to repeat it's quirks. func seems to me to be clearly the most used keyword argument, nearly being worth to be a mandatory, positional argument. size is something useful, but probably even less frequently used then fill_value. So while we could keep the order for fill_value and size in matlab style, we may have func as the first argument, just to allow it to be used without explicitly writing func= in front of it. And when we tend not to lean too close to the matlab implementation, it may be more speaking to use size= instead of sz=. All in all, the cookbook version syntax seems to be well thought through ...

If we could agree on all this, we would finally end up with:

def accumarray(accmap, a, func='sum', size=None, fill_value=0, dtype=None):
    ...

This are the conventions we talked about already, just to sum it up:

Are you fine with this?

d1manson commented 9 years ago

In no particular order...

I suspect that the cookbook accumarray hasn't been that widely adopted by people because it would have been improved upon already..and there's not that much activity on the SO question. When we get this repo sorted out we should endeavor to get it linked to from the cookbook page so people dont find that page and think that's the best thing out there.

C/F - I'm not sure if we're on the same page here... I have used it in the calls to ravel_multi_index and reshape at the start and end of the function respectively. I thought it would be nice for the user to be able to specify the layout of the ndarray that accumarray is creating from scratch. Admittedly the user can control this "manually" by passing in accmap[::-1,:] in order to flip the order of the output if need be, but that's a bit clumsy. Personally, my feeling is that order is a well understood/easily-ignored kwarg in numpy, so I don't feel like there's anything wrong with providing it here.

With the first two args, I guess you're right that a is used elsewhere in numpy, so maybe I should accept it. But I really don't like accmap! ufunc.at uses indices, which is basically what I am suggesting too. Here's some more options: group_idx grouping_idx target_idx out_idx at_idx or maybe a_label a_group a_idx a_index a_dest_idx..anything else?

I'm not too fussed about whether or not we match up with Matlab beyond the first two args, as long as we provide basically the same functionality. Python's superior argument passing syntax is basically doing a pretty good job already - my feeling is that different users are likely to use the other kwargs to different extents, so it's not that easy to provide a one size fits all optimal ordering...hence the thinking that maybe just sticking to Matlab's order is maybe not a terrible idea...but as I say, I'm not too fussed.

I'm not sure what you mean by "a[accmap >= size] is treated as nan"...where accmap>=size, the boundary check should throw an exception, surely? If you did treat those values as nans then when would you put them in the accumulated output?

To clarify the nans in bools and ints question...it's only relevant for the fill_value. And note that the user might want to choose +-Inf as the fill value as well as nan. My feeling on this may have changed a bit...perhaps we should upgrade to at least float32 if the user insists that they want non-finite values in their output array...both for bool and int. By default the fill_value is 0, so it would be the user's fault if they force the change in the dtype. We could always implement it as a second step at the end where you convert form int/bool to float and set the fill_value.

And what did you mean by the last point - "no implicit matching of a to accmap"?

ml31415 commented 9 years ago

C/F is not a big deal in pure numpy operations to support, as numpy takes care of all iteration issues related to it. Having to take care about this in C is quite an annoyance, I wouldn't like to take care of without good reason. Everyone really taking care about the order can easily change the order afterwards imho. I had a bug in the C version once, not correctly recognizing F arrays, producing wrong results, so that's why I said autodetection and acting upon must be granted. Nothing to do for the numpy version here.

Let's pick group_idx then. Seems like a well descriptive choice.

What I meant with treat like nan was, ignore these values. If the size of the output is already given with size= and not determined by the biggest value of group_idx, there is no place in the output to accumulate anything there. It's like cropping the output. It's the current behaviour, nothing special. I just wanted to mention it here.

If dtype is specified, I guess we agree that specifying an invalid fill_value should simply raise a ValueError. If the user specifies dtype=int and fill_value=-inf, what else to do with it?

For more vague situations, fill_value=-inf for a bool operation, no dtype specified, there are points for both. I don't like too much implicit behaviour. Who knows, where the fill_value came from. Some calculation, that went wrong? A bool operation simply doesn't produce nan (or any other float) as an output. Neither should accumarray do, if not explicitly told to do so. Same with sums/prod on int and float values as fill_value. If the user insists on specifying +Inf as a fill_value, I'd expect at least the output dtype to be properly specified, so it's clear what should happen here. The analogy would be setting float values to an int. If it's a more or less sane operation, the value is rounded and set. If it's nan/inf, there is an error. Anyways, this is probably too much detail, we should probably skip this point, until there is some implementation for the ambiguation function.

No implicit matching, if the shapes don't fit together for a sane operation, there shouldn't be any fancy magic making it somehow work, but just a clear error.

ml31415 commented 9 years ago

One more thing. I totally don't care about my name being presented anywhere in the code, readme whatever, or someone copying. This is such a small snippet, not worth to claim any authorship on anything imho. So I'd suggest scratching any name mentionings in the readme and the code. The git history is descriptive enough, if someone really cares. If you'd prefer to have it in the code itself, I'd rather suggest something explicit for that like contributors.txt.

d1manson commented 9 years ago

I don't like silently cropping for out-of-range indices. It's likely to hide bugs in users' code. If they really want to do that then they can do a simple masking before they pass the values in to accumarray - we could give an example of this in the docs. Alternatively you could provide a ignore_out_of_range kwarg, so that it's very clear they are potentially losing data. bincount uses the minlength approach, which lets the user specify the minimum size and then returns a larger array if necessary...that's also a bit weird but at least you don't silently drop data. ..I'm saying we should throw an exception for out of range.

I think I agree with everything you just said about dtypes, let's see what the code ends up looking like and discuss further if need be.

With the order I'm still not sure we're talking about the same thing- you can't autodetect the order of the output because it doesn't exist yet....you can autodetect the input, and you should, but I leave that up to you. Regarding the output, there are a few cases where the user would be grateful for being able to choose which dimensions are contiguous in memory:

I'm not sure how you were planning on doing the multidimensional indexing, but if you use ravel_multi_index and reshape then you actually don't need to worry about the order in C because you only have to deal with 1D indexing. If you're keen to write a single-pass loop over the data then I think it's fairly easy to inline the ravel_multi_index - it's really just a dot product of the index with the shape of the output.

Regarding authorship - I'm fine with removing all references in code, but it would be good to leave one line in the readme somewhere...not a big deal though.

ml31415 commented 9 years ago

Have only one single large value (maybe due to overflow) in group_idx, and your python interpreter is likely to die with some memory error as it can not allocate such a large output array. Fine with me, on the other hand, cropping this is exactly the desired action of specifying size at all. It prevents unwanted memory overflows, if you already know your data. If cropping is undesired, have fill_value=None to downscale unreasonably large group indices.

About ordering for ndim output, I would have defaulted to C and left it to the user to wrap it into np.asfortranarray(accumarray(...)) if desired. I don't work with fortran, so the only time I had to deal with fortran ordering was when importing some matlab data. It's quite common in numpy to have it as an output option, so well then, let's include it.

d1manson commented 9 years ago

Cropping is easy to use and easy to understand, but I feel very strongly that silently dropping some data is a recipe for disaster...I don't know what kind of data you deal with normally, but in my field you often find weird data that needs to be carefully dealt with rather than simply having "bad" values "swept under the carpet"...using fill_value=None doesn't solve this because then you don't know what each of the output bins corresponds to. The behaviour I'm suggesting is inline with Matlab, so I don't think it's that odd a choice..but if you really want to offer the option of dropping these values, then I'm absolutely fine to make it a kwarg - I can add in a masking step at the start of the code, or just throw NotImplemente.

ml31415 commented 9 years ago

For bincount, I'm not getting the point of the minlength argument at all, to be honest. I haven't used matlab in a while and don't have a working installation here. If it simply throws an error for group indices out of range ... why not. I just had checked our codebase, within roughly 400 different calls to accum, not a single one used the size keyword. I had basically implemented it as it was there in matlab, though the details aren't well described in their manual pages. I guess it makes more sense in a ndim context anyways.

In terms of downscaling, getting the corresponding original indices is possible using unique, as described above. Maybe we should include return_indices= for that case? Or for the sake of KISS, drop downscaling completely, and having accumcompressed or something like that offered as a wrapper function instead?

This may well fit together with some other little helpers we're currently using like uaccum, for accum and unpack in one go, and a bunch of other utils.

Looks like we're pretty much through then. One last thing, I'd be a big fan of staying with accum as a name. It's just much shorter, and array isn't carrying any new information anyways. What else to accumulate if not any sort of array? The only reason to stick with it would be, that matlab does it. And matlab itself is not exactly well known for an elegant naming scheme and consistency. accum seems to be more the concise numpy way of naming.

d1manson commented 9 years ago

Regarding size - I think that just goes to show that there are many ways and different reasons to be using accumarray..it's such a versatile function! [The minlength in bincount is a bit odd, as I say, though if you cap your indices before passing them to the function it is kind of useful because you are then basically specifying the size. ..and specifying size is useful if there is a well-defined, finite, number of groups, but you don't necessarily have data for all of them.]

I agree with the concept of a helper library. Potentially you could attach them as attributes to the main function, e.g. accumarray.dropping_empty_groups(...) or whatever - it could be nice in terms of readability, and it's similar to numpy's ufunc.at, ufunc.reduce etc. But maybe it's easier to keep it separate from the main function - then we don't have to be so strict with API definition and testing etc. ...I did find your choice of naming a bit unintuitive, though I guess it's difficult to describe such a specific operation in so few words.

Anyway, to address the question of accum versus accumarray...I agree that Matlab chose a stupid name. It should have been group, or aggregate or something, not accum, because that implies summing only. And yes the array doesn't add anything. And it annoys me that Matlab abbreviates some things and not others [the worst culprit has to be their function prctile]. But I feel like there is a lot to be said for sticking with their name...people looking for it will try looking for accumarray first (e.g. the SO question, the name of the scipy cookbook entry, even the name of this repo!), even if they would potentially try a more general search afterwards....even people who have used it, but can't remember where they found it would probably prefer not to have to remember another name for it. Also, if you did call it accum, you have the problem that it makes it sound like it only does summing (which as I say, is a problem with the Matlab name as well, but it would be more confusing when it's just accum by itself)...really, I think it comes down to a question of "branding": I'm happy to piggyback off the "brand"/"concept" created by Matlab, I mean people always seem to refer to the Matlab function when discussing this so I feel like it's probably worth not confusing the matter by changing the name.

ml31415 commented 9 years ago

Hmm, I just feel like there will be a time after matlab, and probably it has already started. So it may be a good time to get rid of some old luggage. I don't really like to compromise with any matlab shortcomings. Pandas goes with group. I don't know if this name is already used in the numpy context, and it's not clearly describing, what happens either. aggregate seems like the best fit to me. Maybe a bit long.

It's fine when the module keeps accumarray as the name. This way someone finds it, when looking for this functionality. Inside the module I suppose, we're free to pick another more fitting name, more in line with good naming practice and numpy conventions.

d1manson commented 9 years ago

After writing the previous post, I thought about it a bit more and decided that really I would also prefer aggregate.

If we go with that though, then I think we should try and drop all mention of "accumarray", except when referring explicitly to Matlab...so maybe rename the repo to py-aggregate or something...I leave that to you.

ml31415 commented 9 years ago

Agreed with removing all mentions of accumarray. I just wouldn't like to rename the repo right now. Google search etc. any accumarray links point here. Maybe later.

d1manson commented 9 years ago

I see there's also this other question that you've answered...looks like matplotlib.mlab.rec_groupby is also something to look into.

ml31415 commented 9 years ago

When breaking with matlab traditions ... something like aggregate(group_idx, a).nansum() could be nice. And maybe also providing ufunc.at syntax ... A wrapper for matlab syntax could also be provided. Maybe I need to rethink the layout a bit more, before refactoring all the code. So that we can have one clean implementation, and also can easily provide different interfaces for it.

d1manson commented 9 years ago

Hmm, I'm not sure you gain much syntactically by doing that.

In terms of optimization, I guess there may be something to be said for creating an aggregate object and then executing commands on it - things like sorting, removing nans, calling ravel_multi_index etc. could be done once and then you can call multiple things like min and max, etc...but that is getting quite complicated and probably not a particularly common useage case. It might be easier to support func=('min,'max'), i.e. if func is a tuple then return a tuple giving all the requested aggregations. You could then super-optimise your weave stuff, to do everything in a single pass if you really wanted to!

ml31415 commented 9 years ago

I'd rather throw out the weave stuff asap ;) Nevertheless, yes, there may be more room for improvement. E.g. collecting max(group_id) would only be needed to be done once. Though, no one guarantees, that nobody messes around with group_id in the mean time between different calls ... suppose we should stick with the syntax discussed so far, and see if we want to add something later.

d1manson commented 9 years ago

Well, I do quite like the idea of offering tuple/list support for func...it's easy enough to detect and throw NotImplemented if need be, and there are a few optimisations to be had in numpy, though the real gains would be in weave ...presumably, computing the min and the max simultaneuously would only take a fraction longer than computing just min or just max...though the code might start getting complicated if you wanted to take advantage of that...anyway, I don't think I'm going to touch the weave stuff myself, so it's up to you.

ml31415 commented 9 years ago

I'm not against it, though I don't plan to implement any extensions to the C code besides the 2D indexing in the near future. It's a quite messy thing anyways already.

ml31415 commented 9 years ago

Most testing and benchmarking is mainly working again now after the refactoring. I was a bit surprised about the bad pandas performance compared to the C implementation, as they claim to be fast. Initially I thought, maybe the construction of the dataframe would be a big overhead, but doubling the array size nearly doubled the pandas time. So the initial setup time seems to be quite small, close to being neglectable, compared to the iteration time, and pandas still being a magnitude slower than the C version. Maybe they suffer the same problems like ufunc.at in having a generic solution.

I also added the rudimentary start of my numba implementation. Don't know exactly anymore where I had left it. It was some numba versions ago, and I had stopped working further on it, as there were some annoying issues with numba, that I didn't want to bother with further. Chances are, this is fixed now already. Would be worth to give this another try now.

My plan for the next days now: I want to have the current features and testing working again. Most of the xfails should be gone first. After that, I'd go on with adding 2d group_idx for C etc. ideally in a testdriven way, so designing tests first, implementation later.

d1manson commented 9 years ago

It's all looking nice so far.

I think I've actually got a working version of numba installed, so I should be able to play with that as and when it's ready.

One thought...if people don't want the whole repo will all the implementations and testing and benching etc. is there going to be an easy way for them to just download a single file, or perhaps two files..I guess we could suggest copy-pasting the contents of utils.py at the top of your implementation of choice, replacing the from utils import ... If that is the best way of doing it, then it would be worth making sure we don't use import something as something_else, because that would stop the copy-paste from working ..I notice that you did this with aliasing_numpy as aliasing.

Getting the dtypes right is definitely a bit of a messy question. I notice that there are a few lesser-known functions in numpy which might be relevant, see this Note in the docs. We should probably follow their conventions as far as possible.

ml31415 commented 9 years ago

I don't like copying the content of utils into every single file at all. Too much duplicate code there, way too much. The whole idea of utils was the reduction of duplicate code, and it's by far not yet finished. It may be worth to make a standalone gist for single implementations and only the most necessary code, once the code is somewhat stable, yes. Especially the numpy only solutions may be relevant there.

Importing aliasing_numpy directly or renamed, it will be only exactly one line to change in any case after the copying. Actually I had renamed it exactly for the sake of better interchangeability of the remaining code. I wouldn't want to worry about that too much right now. There will probably be no automated process to make that gist, but it also won't take more than 3 minutes, to copy it together, if required.

You're talking about np.can_cast? Lovely function, indeed. Didn't know about that. Had my experiments with using dtype.type as you've probably seen. I'll leave it like that for now, but having can_cast in there may make the life quite a bit easier.

ml31415 commented 9 years ago

Btw, one of the gains of the other syntax: aggregate(group_idx, a).std(ddof=1) would easily possible, without passing all kwargs to any subfunction. The functor could receive it's own parameters in a clean way.

d1manson commented 9 years ago

I'm not convinced - it looks more complicated that way and I think the code will end up being more complicated. Isn't ddof the only extra arg for the main set of functions we have? If so couldn't we have some special code that passes it through where necessary or throw an Exception otherwise. Alternatively you could implement it as a setting: aggregate.ddof = 1; aggregate(group_idx, a, func='std'). If you keep it the "simple" way it's easier to support arbitrary functions, although you can of course invent a syntax for that too: aggregate(group_idx, a).using(custom_func) or something.

Incidentally, I was looking at optimising min and max..I haven't managed to do any better than ufunc.at, although if you wanted to compute both min and max it would only take about 10% longer than computing just one or the other...at least for the benchmarking inputs I've been using. I'm sure there are a few other similar pairings, like mean, sum, and var/std which can be co-computed more quickly than if they are done separately....as I said before, you could take advantage of this with either syntax, but I reckon it would be a lot easier with the fun=<tuple_of_desired_outputs> syntax, because then you can see all the desired outputs in one go and work out the best way of doing everything there (which might simply mean looping over each func spearately, but could involve one of these numpy optimisations, or a more fancy C optimisation)..if you were using the aggregate().sum syntax then you need to cache stuff inside the object and pllay all kinds of games to second guess what the user is trying to do.

To summarise, I'm not completely against it, but I would prefer the simple way.

d1manson commented 9 years ago

For reference, the min+max optimisation, is as follows...

get the indices from argsort(a), which takes almost as long as running ufunc.at. You can then apply those sorting indices to both group_idx and a, and use that in conjunction with the first and last tricks...or if you already know the counts, you could maybe use the cumsum(counts) trick...but I don't think that's much quicker. [<you can't use counts..that's for a different kind of sorting]

d1manson commented 9 years ago

(Note my previous comment had a logical mistake which I've edited out)

With the copy-paste thing, I wasn't suggesting that we do it - I definitely think we should have a separate utils file, I was just thinking that we could put a note near the top of the readme, recommending, that if the user wants one simple file that they can do the copy-paste themselves rather than downloading everything. The gist seems like a good option too, although a bit of a hassle for us, especially if we were going to maintain copies of more than one implementation.

ml31415 commented 9 years ago

The func-tuple would also have the benefit, that it would be clear how long to keep temporary data. No problem with invalidating a cache, changes in the array, while cached data wasn't invalidated and such hazzle. If we really wanted to do that, all the functors should become classes, so that they could have attributes which temporary results they produce/require, and that the dependencies could be calculated optimally.

Though, it would be a major hazzle to have that elegantly in C in maximum performance in this generality. I won't do that now, also not in the near future, and afterwards the current numpy C-API will likely be completely dropped. For the other implementations, I see much less optimization potential, as most of the speedup would come from making many things in one go, within one single iteration. This detail would be only possible by hand-crafted auto-generated C-code, and maybe, only maybe, with some numba fanciness. In the end, I don't think it is worth it right now.

All in all, I don't think any further syntax change makes too much sense right now. Neither the one I had talked about, nor the tuple thing. So let's just see that as a kind of thoughts collection, and get the current stuff back working, fix the not implemented things as good as possible, and make it the new master version.

d1manson commented 9 years ago

Ok, I've put the multiple functions thing in a new issue...https://github.com/ml31415/accumarray/issues/3

ml31415 commented 9 years ago

Ok, done for today. Benchmarks and tests working again. Tomorrow 2d for C funcs + tests. Current benchmarks:

function             py          ufunc          numpy          weave         pandas
-----------------------------------------------------------------------------------
sum             447.804         29.212          1.534          1.692         29.341
amin            464.312         31.669         31.343          2.074         29.153
amax            463.859         31.679         31.512          2.043         28.929
prod            458.839         30.250         30.083          1.597         29.277
all             426.466         35.240          1.802          1.564       1413.038
any             421.425         34.422          4.695          1.465       1421.063
mean            456.021           ----          2.472          1.565         24.063
std             581.687           ----          4.565          2.091        410.885
nansum          566.902           ----          2.940          1.568         25.872
nanmin          578.444           ----         32.391          2.099         25.857
nanmax          579.066           ----         32.849          2.127         29.285
nanmean         624.522           ----          3.752          1.700         25.133
nanstd          699.913           ----          5.918          2.217        409.360
anynan          463.876           ----          1.066          0.995        873.658
allnan          439.034           ----          3.479          3.375        874.282
d1manson commented 9 years ago

Looking at all versus any for numpy you can see that there is quite a big difference depending on the actual values in the inputs...(those two funcs are basically doing the same thing).

Ideally, we should have a range of different input values and shapes etc. in order to get a better sense of benching.

ml31415 commented 9 years ago

Indeed somewhat strange. Right now it's also not such a useful test at all for the any/all, as there are no False values in the set, just some random float noise.

ml31415 commented 9 years ago

Nice to see, what a little path optimization can do. Performance crown for all disciplines back to weave I'd say. Though admittedly, for sum the results are so similar now, looks like I managed to produce quite the same machine code as bincount. all, any, allnan and anynan are about 15-20% faster now. I had tried the same approach for min and max, but there it turned out as a performance penalty.

The current implementation profits, if the fill_value equals the neutral element of the operation, 0, for sum, and any, 1 for prod and all, as setting this can be skipped in that case.

function             py          ufunc          numpy          weave         pandas
-----------------------------------------------------------------------------------
sum             459.352         29.488          1.550          1.533         28.357
amin            471.674         33.471         31.318          2.046         28.621
amax            479.402         31.737         31.589          2.122         28.598
prod            464.380         30.633         30.023          1.550         29.708
all             431.441         35.358          1.797          1.166       1434.810
any             426.978         34.184          4.720          1.232       1502.974
mean            454.718           ----          2.395          1.914         23.165
std             584.959           ----          4.518          2.109        411.108
nansum          560.355           ----          2.806          1.443         25.624
nanmin          581.653           ----         32.371          2.081         25.305
nanmax          584.505           ----         33.268          2.922         25.701
nanmean         568.045           ----          3.794          1.637         27.326
nanstd          697.972           ----          5.791          2.198        412.248
anynan          460.697           ----          0.966          0.939        865.429
allnan          446.610           ----          3.345          3.326        866.557
d1manson commented 9 years ago

Yes, lookin' good.

Do you know whether you're getting full SIMD vectorisation? I know numpy doesn't always manage it - https://github.com/numpy/numpy/pull/5144 - at least 1.9 doesn't. I'm trying to understand why min and max can't quite compete with sum...maybe there is a real hardware/logical reason for it?

pandas is doing miserably on some of those funcs...even purepy is beating it in many cases, that's ridiculous. Maybe we could ask someone from pandas to take a look and see if they have any thoughts!?

Are you still planning on doing the numba stuff? Even if the results are similar to weave I guess there's something to be said for having both given that people (such as myself) may only have one or the other installed and working.

btw...did you catch that comment I left on utils.get_func_str...I didn't like idea of reading the name of arbitrary functions. I thought it would be better not to attempt to guess what a user-supplied function is doing simply based on its name.

ml31415 commented 9 years ago

Good morning Daniel! Yes, I saw that comment. I'd be fine to only catch well known numpy functions. I used to use bottleneck for some operations, I'd probably add their functions to the detection as well, if they are importable, so not being dependent on them.

I can imagine the pandas guys definitely want to iron that out, maybe we should just open a bug report there. Either now or after we made this branch the new master. I saw already, that you're very good at that ... ;)

I'm quite interested in getting the numba version done in some mid term perspective. The C API for numpy, that weave is using is deprecated, so with numpy 2.0 afaik it will be gone. That means, I need another speedy replacement. numba is the only candidate, which I hope could end op with similar results as weave.

In terms of SIMD, I turned -O3 on now, which is supposed to allow vectorization. I though that would have been default for weave, but it wasn't. Looks like this gave another little speedup. Using a smaller int type for group_idx should profit more I suppose.

function             py          ufunc          numpy          weave         pandas
-----------------------------------------------------------------------------------
sum             447.070         29.095          1.564          1.425         28.815
amin            457.130         31.428         31.184          2.031         28.935
amax            460.381         31.632         32.000          2.020         28.819
prod            458.022         29.901         30.961          1.437         28.538
all             428.311         35.172          1.800          1.119       1466.945
any             421.640         36.417          4.705          1.190       1417.496
mean            455.721           ----          2.382          1.575         22.925
std             584.937           ----          4.471          2.132        409.972
nansum          567.291           ----          2.826          1.444         25.763
nanmin          577.232           ----         32.418          2.127         25.739
nanmax          580.206           ----         33.473          2.105         25.755
nanmean         572.432           ----          3.652          1.616         25.046
nanstd          701.725           ----          5.747          2.178        410.060
anynan          461.680           ----          0.949          0.956        857.830
allnan          440.435           ----          3.358          3.328        856.977
ml31415 commented 9 years ago

The previous test wasn't that fair. I unfortunately had used Timer.timit(10), which runs everything, including setup code and counts that together. This means e.g. creating 10 DataFrames for pandas. I changed the test to take best of 3 now, and kept the instruction cycles the same. Pandas is doing a fair bit better now.

function             py          ufunc          numpy          weave         pandas
-----------------------------------------------------------------------------------
sum             425.832         26.158          1.452          1.235         16.606
amin            414.143         27.439         27.371          1.388         16.993
amax            417.362         27.863         27.842          1.413         16.707
prod            427.957         26.988         26.913          1.247         16.503
all             403.089         31.748          1.614          0.980        437.865
any             400.536         30.722          4.581          1.057        434.577
mean            423.542           ----          2.179          1.374          9.432
std             534.851           ----          4.993          1.667        131.912
nansum          538.288           ----          3.078          1.310         11.024
nanmin          540.087           ----         29.460          1.506         11.232
nanmax          538.084           ----         29.443          1.554         11.185
nanmean         544.123           ----          3.833          1.412         10.999
nanstd          653.580           ----          6.613          1.722        131.985
anynan          432.374           ----          0.878          0.865        267.062
allnan          403.610           ----          3.507          3.514        266.326
d1manson commented 9 years ago

With SIMD I guess it shouldn't really be a surprise that there's not much change - neither LLVM nor GCC claims to do any magic for multiple simultaneous reductions, only for single reductions. http://llvm.org/docs/Vectorizers.html#the-loop-vectorizer https://gcc.gnu.org/projects/tree-ssa/vectorization.html#slp-reduc-2 I don't think a smaller data type for group_idx will help with SIMD because I think all indexing operations are converted to 32/64 bit integers, but it will possibly help with memory bandwidth..less data moving around means it gets to the cpu faster..though reading group_idx from memory may not be the bottleneck.

One thing you could try, if you were really serious, is for the case where the size of the output is pretty small. In that case, you could duplicate each bin, giving one slot for each of the elements within a SIMD register, i.e. for sums your output might be: [0_a 0_b 0_c 0_d 1_a 1_b 1_c 1_d .... n_d]. If the whole output vector still fits into L1 cache (probably 32KB), then I think you probably won't loose much in terms of read/write speed, but you can then compute at 4x faster (or however many elements fit in your SIMD register). At the end you have to run over the output vector once, doing a final reduction of the 4 contiguous values into 1, but that's definitely SIMDable, and since it's small and in L1 it should run super fast...and plus in some cases in your code you have to do that final pass-over already.

d1manson commented 9 years ago

I meant to say in the previous comment that one obvious case where the above optimisation is probably used is in image libraries, where they produce histograms for int8 data on each channel...might be worth trying to find an example and see if they do indeed do that.

d1manson commented 9 years ago

Hmm. what I suggested may not be such a great idea: http://stackoverflow.com/questions/21774454/how-are-the-gather-instructions-in-avx2-implemented https://software.intel.com/en-us/articles/understanding-gather-scatter-instructions-and-the-gather-scatter-unroll-compiler-switch

ml31415 commented 9 years ago

I'm also not really an expert for this bit fiddling. Not sure if it's worth to dig here further. So far I'm more or less fine with the current speed. If it should be even faster, it would probably better throwing that into the graphics card and getting it multithreaded first. Current benchmarks:

function             py          ufunc          numpy          weave         pandas
-----------------------------------------------------------------------------------
sum             461.919         26.215          1.518          1.297         17.298
amin            458.539         28.502         27.387          1.279         16.884
amax            456.517         27.840         27.952          1.428         16.679
prod            457.963         27.116         27.008          1.317         16.549
all             450.560         31.816          2.992          1.899        432.901
any             448.340         33.128          4.941          1.769        433.360
mean            486.738           ----          2.177          1.393          9.425
std             575.171           ----          5.081          1.687        130.671
nansum          471.691           ----          4.765          1.780         11.637
nanmin          471.319           ----         25.434          1.998         11.628
nanmax          472.273           ----         25.972          2.077         11.703
nanmean         605.367           ----          5.280          1.887         11.770
nanstd          770.958           ----          7.475          2.151        130.542
anynan          406.154         25.931          2.261          1.718        266.696
allnan          405.386         27.845          3.780          1.707        268.729
d1manson commented 9 years ago

[I was about to post the following, before I saw your comment]

Ok, well since I've dug myself a bit of a hole, here's a proper paper dealing with this properly: http://pcl.intel-research.net/publications/histogram-pact2014.pdf The relevant sections are pretty short - Section 2.2 basically suggests the same thing I did when the number of bins is very small, but Section 2.3 suggests something more helpful (and fairly obvious) which is to use multiple threads... With multiple threads there are two main options:

d1manson commented 9 years ago

numba lets you do multithreading apparently..dunno about weave http://numba.pydata.org/numba-doc/0.15.1/examples.html#multi-threading

ml31415 commented 9 years ago

Btw. did you manage to get weave to run with windows?

ml31415 commented 9 years ago

Pushed some more changes now. Some test cases for multi-indexing are still missing. Besides that, we should be quite good to go for making this master.