holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.32k stars 366 forks source link

Adding median reduction #245

Open jbednar opened 8 years ago

jbednar commented 8 years ago

It would be great to have a reduction operator that computes a median. Here's a first pass:

https://gist.github.com/kidpixo/16e76ac201375e4d3fb4bf89b8a9b6ad

(Notes adapted from @jcrist:) Unfortunately, median is one of the few reductions that can't be parallelized trivially. In the code above, self.column is just a string name, not the column itself, hence the error shown. Implementing median will require custom append, combine, and finalize methods, so it will be fairly complicated. There are various parallel median algorithms findable on google, but it's not clear that they would apply in this case.

kidpixo commented 8 years ago

Hi, thanks for opening this issue. I'm a bad programmer, so it's for me fairly hard to follow the code: I didn't have any clue about the njit decorato, now I get it.

The heart of the matter for me is that I have highly noisy data and my naive implementation of multi-dimensional binning (kidpixo/multibinner: python module based on pandas to generate multidimensional bin of large dataset.) takes ages. Not optimizing, I can pass whatever function I could imagine, at expenses of computation time. The median is a pretty solid estimator of the center of the distribution, taking care of outliers and noise in the data.

Now, datashader literally blew my mind and I would really see this thing working.


About my naive code : I simply copy the code from the mean class, but for me it's hard to follow what happens. I thought that the self.column is just a placeholder for subsequent call to the dataframe/xarray columns, but I can't where and how.

Correct me if I'm wrong :

Could this stackoverflow discussion "Array of ints in numba" helps to add at leas an implementation od the mode? Maybe it is better to implement.

jbednar commented 8 years ago

Unfortunately, our existing reduction examples are all for operations that can be implemented incrementally, without having to store all the points that map into each pixel. E.g. a sum() operator can be implemented by simply accumulating a single value, adding each new one without ever increasing any data storage requirements. That's a big part of what makes datashader be able to be so efficient, avoiding multiple passes through the dataset or ever having to store the entire dataset in memory. Unfortunately, finding the median of a distribution in a pixel requires having access to that distribution or some approximation of it, and so you'll have to take a very different approach from our current operators. @jcrist wrote our existing reductions, but neither he nor I have bandwidth to work on writing such a function at the moment, though we agree it would be a great thing to support. So alas, for now this is something you'd have to figure out yourself (or hire someone to do here, if you've got the money). Or maybe someone who sees this message will jump in with something!

kidpixo commented 8 years ago

Ok, give me your opinion on this trick: sort beforehand the column to be processed by reduction.

It sounds crazy and computationally heavy, but then I get for free, for each bin of the final image :

In your opinion, is extracting a certain element from the (x, y) bin easily doable ?

jbednar commented 8 years ago

Maybe I'm misunderstanding the proposal, but I don't see how sorting the column in the original data frame would help. To compute the median in each pixel, you need the full set of values that map into that pixel, regardless of any ordering in the original non-aggregated data structure. None of our current reduction operators ever keep track of the full set of values that map into that pixel, which is why they are so efficient. If you want to write one that does keep track of all datapoints that fall into a certain pixel, it will be very expensive to run but then computing the median or any other statistic at that point will be trivial. The API allows you to do that, but it's just not something we have previously implemented and so it would take some thought to put that together, particularly if we want it to be at all efficient. We would welcome a contribution of code that does that!

kidpixo commented 8 years ago

Correct me if I wrong, but :


Now, If I am not wrong , my idea would be to return the count of element that map into that pixel in _base and the extract the count/2 element the _finalize method.

Here is were I need some idea how to do that, some trick maybe.

The 'higher order' reduction like 'mean' use simpler reduction methods. This example is from mean here :

    @property
    def _bases(self):
        return (sum(self.column), count(self.column))

    @staticmethod
    def _finalize(bases, **kwargs):
        sums, counts = bases
        with np.errstate(divide='ignore', invalid='ignore'):
            x = sums/counts
        return xr.DataArray(x, **kwargs)
jbednar commented 8 years ago

the median of a sorted array is the element at count/2 (or (count+1)/2 for odd number of elements).

Right.

if the original data structure to be mapped is sorted, so are the data values that map into each pixel.

Well, not necessarily; if the reduction operation is parallelized, as all our current ones are, then there is no guarantee that any original order in the data frame is preserved. You can certainly implement a sequential aggregation that guarantees to preserve the ordering, but it would have lower performance during the aggregation step. Moreover, it's vastly cheaper to sort a bunch of small lists (per pixel) than one huge list, of the same total length, so sorting after aggregation would actually be much faster. Still, if (a) you do enough different aggregations on the same dataframe and thus don't mind having to do an expensive operation if you only have to do it once, and (b) don't have many processing cores to use during aggregation so that you don't mind it being a sequential algorithm and (c) provide a way to tell the reduction function that the original data frame is sorted, then such an approach could make sense.

In any case, relying on the original dataframe's ordering is only an implementation detail; it won't make the code shorter, it won't use less memory, and only in the special case outlined above would it make things faster. What's crucial no matter how you do it is to keep track of the full set of datapoints per pixel during aggregation, so that you can later extract the median. That's not hard, just not something we have an example of yet, and it will require having enough memory to store two copies of the entire dataset.

jbednar commented 8 years ago

BTW, I fully support us adding such an example, because having the full set of datapoints per pixel allows any conceivable reduction operation to be performed, not just a median. So even though it will be much worse in performance than our current ones, having it as an example to follow is a great idea. We just don't have anyone with free time to do it at present; we're using datashader in a lot of different projects, none of which currently requires such an approach, so it would need to be developed by someone else if it's to happen in the near future.

kidpixo commented 8 years ago

Well, not necessarily; if the reduction operation is parallelized, as all our current ones are, then there is no guarantee that any original order in the data frame is preserved.

Ok, I get it now.

I'm looking forward to an example to follow, if you or someone will have time.

Thanks a lot for your comments!

jbednar commented 7 years ago

Unfortunately, I still don't have an example to provide, but I do have a suggestion from @jcrist. He's got a new library just released on PyPI, with source at https://github.com/jcrist/crick, that provides a few useful approximate algorithms, including one for approximating an unknown distribution in a single pass. This would be very practical to use in datashader, unlike the above completely correct but very impractical approach. So what would really be great is if someone would implement both, the correct but slow approach and the approximate but very fast and memory efficient approach from crick, to allow people to use the approximate one where feasible but then fall back to the correct one for anything crucial. Seems like a good way forward to me, but again we don't have anyone with the bandwidth to work on such a task in the near future.

kidpixo commented 7 years ago

Hi @jbednar, great, thanks for the heads up, Maybe @jcrist will add some example in the repo at some point.