Can get_reverse_indices be deleted in favour of NumPy Digitize?

DanRyanIrish commented 4 years ago

Richard has pointed out that numpy.digitize does get the reverse indices. Therefore, can we not delete the get_reverse_indices entirely?

hayesla commented 4 years ago

please lets 👍

DanRyanIrish commented 4 years ago

So throughout the code numpy.digitize is used except in a few places. numpy.digitize gives the indices of the bins in which each array element belongs, i.e. the output has the same number of elements as the array. get_reverse_indices gives 3 outputs, the first of which is equivalent to numpy.digitize.

However, there are case where we actually want the array indices to retrieve the elements from the array in a single bin, i.e. the output is a tuple of array indices with the same number of arrays as bins. This is not given by numpy.digitize and is the second output of get_reverse_indices. So get_reverse_indices is only used in the code-base where this output is needed. Therefore I still think we need the get_reverse_indices function. Although the outputs could be reduced and the function could be renamed.

rschwartz70 commented 4 years ago

Sorry, I'm missing the difference but even so wouldn't it be better for numpy.digitize to do the base work. I'm a noob. I suppose I'll have to go back to tuple IDL equivalence school. Isn't that a bit like a named structure (I'm hopeless) arrays vs tuples vs lists arrays vs structures vs structure arrays vs pointers

DanRyanIrish commented 4 years ago

Here's an example:

>>> from sunxspex.utils import utils 
>>> import numpy as np
>>> x = np.array([0.81040131, 0.51947418, 0.00271458, 0.88044985, 0.894958  ,
                  0.00730251, 0.32388895, 0.16620437, 0.61610315, 0.86217243,
                  0.56787043, 0.82788122, 0.38906945, 0.74824418, 0.5476604 ,
                  0.43311532, 0.22649135, 0.07481439, 0.54439312, 0.45224101])
>>> ri = utils.get_reverse_indices(x, 5, 0, 1)
>>> di = np.digitize(x, np.array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])) # The bin edges here are equivalent to the upper and lower limits of 0 and 1 set above in the get_reverse_indices(x, 5, 0, 1)
>>> ri[0] == di-1 # First output of get_reverse_indices equivalent to numpy digitize
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])
>>> r[1] # Second output gives inverse characterization, i.e. the indices of the array elements in each bin
(array([ 2,  5,  7, 17]),
 array([ 6, 12, 16]),
 array([ 1, 10, 14, 15, 18, 19]),
 array([ 8, 13]),
 array([ 0,  3,  4,  9, 11]))
>>> ri[2] # Final output gives bin edges derived from user-supplied upper and lower limits
array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])

DanRyanIrish commented 4 years ago

Sorry, I'm missing the difference but even so wouldn't it be better for numpy.digitize to do the base work.

I think we could reimplement get_reverse_indices to use numpy.digitize, generate from that the array element indices in each bin, and then only output that set of arrays... Although maybe it's better to return everything in case the user wants both characterizations. No need for the user to effectively run numpy.digitize twice for no reason if they do want both.

DanRyanIrish commented 4 years ago

I suppose I'll have to go back to tuple IDL equivalence school. Isn't that a bit like a named structure (I'm hopeless) arrays vs tuples vs lists arrays vs structures vs structure arrays vs pointers

There are many cases where it's a matter of preference rather than correctness as to how you package up your data (arrays, tuples of arrays, dictionaries etc.). In this case though it's a matter of using the most simple Python data packaging type, i.e. the tuple. We could return a list of arrays or something fancier. But there's no need to complicate it.

More generally though, the two fundamental data packaging types in Python are the tuple and the list. These don't have any restrictions as to the variety of data types you store within them (unlike numpy arrays). The main difference between tuples and lists are that tuples are immutable while lists are mutable. As for named structures, the most basic Python equivalent is the dictionary, which is like a list that you index with keys, not ints or slices. Although there are more sophisticated data types with more functionality and restrictions.

rschwartz70 commented 4 years ago

I do know that the simplest is best here. Tuple is better when you know what is coming I think. I thought I had seen something where digitize was all that was needed but it might have been digitize plus something. I'll look. I've built fast idl code on histogram plus reverseindices where others might use multiple loops with where's so if I can make any contribution on the Python side it will be in finding the best Python base code for this. Thank you for your patience

DanRyanIrish commented 4 years ago

I thought I had seen something where digitize was all that was needed but it might have been digitize plus something. I'll look.

In places where numpy.digitize is all that is needed, it is now used. get_reverse_indices is only used in a few places now where the array indices in each bin is needed rather than the bin index for each array element.

I've built fast idl code on histogram plus reverseindices where others might use multiple loops with where's so if I can make any contribution on the Python side it will be in finding the best Python base code for this. Thank you for your patience

Thanks so much for your input and engagement! It's challenging me to think more carefully about efficiency when coding Python.

rschwartz70 commented 4 years ago

I think I'm seeing the difference. You need to do what reverseindices( rev, i) does after histogram( array, bin = .1, rev=rev)In numpy arbitrary edges are easier but you are providing the dave fanning reverseindices function. I did something similar in my own IDL until Fanning's function. But of course I used histogram as the basic engine and by using the IDL base code I obtained the needed speed. The critical component was histogram which uses a powerful sorting alg. That was a definite chokepoint for RHESSI if not optimized. Thanks

sunpy / sunkit-spex

Can get_reverse_indices be deleted in favour of NumPy Digitize? #30