vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.27k stars 590 forks source link

2d histogram within min/max limits has border rows/column that are all zero #2337

Open abf7d opened 1 year ago

abf7d commented 1 year ago

Description I'm trying to bin a two dimensional histogram using the df.count method. I wish for the histogram to be bound inside the min/max points for each axis. In other words I want a histogram to stretch out over the whole chart. I'm expecting to get a histogram that has at least one non-zero bin in every edge row or column. The problem is I get back histograms that have multiple contiguous zero rows or columns on the border.

How do I generate a histogram of two columns where each edge contains the bounding min or max value for the row / column?

Here is an example of a histogram that I generated which is not bound by non-zero bins along the edges. The top, bottom, and right edges of this histogram have a lot of empty area:

image

The bin values match what is rendering in the chart:.

In my code, I first get the limits:

limits = df.limits(list(axes_val.values()), delay=True, selection=True)
    await df.execute_async()
    limits = await limits

then I get and return the bins:

    hist = df.count(
        binby=list(axes_val.values()),
        limits=limits,
        shape=num_bins,
        delay=True,
        selection=True,
    )

    await df.execute_async()
    hist = await hist

    # filters out any zeroes
    if sum(hist[hist > 0].shape) == 0:
        counts = [0, 0]

    else:
        counts = [hist[hist > 0].min(), hist.max()]
        counts = [0 if numpy.isinf(c) else c for c in counts]

        # Normalize the histogram counts
        hist = (hist - counts[0]) / (counts[1] - counts[0] + 0.001)
        hist = hist * 254 + 1
        hist[hist < 0] = 0
        hist = hist.astype("uint8")

    output = {"bins": hist.tolist(), "limits": limits, "counts": counts}

    return output

Software information

Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc..).

maartenbreddels commented 1 year ago

You probably have some outliers in your data. And, in Vaex, the histogram bins are half open [min, max). A dirty way to include the last value in the last bin is to do. limits=[[xmin, xmax+eps], [ymin, ymax+eps], ...] where eps=1e-10, or ideally (1e-16/(xmin-xmax). Does that make sense?

abf7d commented 1 year ago

I think I understand. Let me clarify: So by half open, do you mean that, for the max value, the bins go up to but don't include the last point? I should add eps caculation to my max values to include the max point?

Also should that value be be (1e-16/(xmax - xmin)) or (1e-16/(xmin - xmax))?

maartenbreddels commented 1 year ago

Yes, and yes :) and yes!

abf7d commented 1 year ago

Thank you so much. I tried the formula provided and it looks like for one of my axes eps is too small. It gets rounded off. When I tried eps=1e-10 it works. Again, I appreciate you pointing me in the right direction and your quick response!