scikit-hep / hist

Histogramming for analysis powered by boost-histogram
https://hist.readthedocs.io
BSD 3-Clause "New" or "Revised" License
126 stars 25 forks source link

[FEATURE] faster display of 2D histograms #112

Open henryiii opened 3 years ago

henryiii commented 3 years ago

The 2D histogram repr could use an embedded bitmap, as the current display is a bit slow. https://stackoverflow.com/questions/6249664/does-svg-support-embedding-of-bitmap-images

jpivarski commented 2 years ago

I thought I would add these resources here before they get swept away in https://gitter.im/HSF/PyHEP-histogramming.

Yes, SVG can embed raster images, (see specification, and stick to the universally supported version 1.1), and the easiest and most portable way to do that is as PNG (see specification). SVG is one of my favorite languages, though I haven't used it in a long time. That's one of the reasons I'm such a fan of Inkscape.

Here's an example of generating a PNG image, base64-encoding it, and inserting it into an SVG document:

https://github.com/modelop/cassius/blob/60c87efb310d7061ce3f210e7f4ae857dfb92db7/trunk/cassius/svgdraw.py#L915-L952

One thing I remember having to worry about is vertically flipping the Y axis. At least, make sure that your tests are not symmetric.

Different browsers render small-scale pixelmaps differently: if the scale is such that the individual pixels are large, some browsers will show you sharp edges for those pixels while others will try to blur them, which looks odd at a large scale. You can't convince the browser to not try to blur them, even with the image-rendering property (see specification). Either accept the blur or fall back on vectors when the number of pixels is less than 10 or 20.

A problem with the code above is that it uses PIL to construct the image, which is yet another dependency. Another problem is that there's a double-for loop in Python to fill it. (I was young and foolish.) The conversion from array data to PNG can be done in an entirely array-oriented way with zero dependencies beyond NumPy and the Python standard library. See the first cell of:

https://github.com/jpivarski-talks/2019-04-08-picscie-numpy/blob/master/04-cupy-and-dask.ipynb

which is

def png(rgb):   # Numpy is concise enough to create PNG files from raw bytes in one screenful of code
    def chunk(tag, data):
        out = numpy.empty(4 + len(tag) + len(data) + 4, "u1")   # each chunk consists of:
        out[0:4] = numpy.array([len(data)], ">u4").view("u1")   #   4-byte data size
        out[4:8] = numpy.frombuffer(tag, "u1")                  #   4-byte tag (data type)
        out[8:8 + len(data)] = numpy.frombuffer(data, "u1")     #   the data
        crc = zlib.crc32(data, zlib.crc32(tag))                 #   CRC: cyclic redundancy check
        out[-4:] = numpy.array([crc & 0xffffffff], ">u4").view("u1")
        return out

    preamble = numpy.frombuffer(b"\x89PNG\r\n\x1a\n", "u1")     # preamble: "this is a PNG file"
    width_height = numpy.array(rgb.shape[1::-1], ">u4")         # header: size + image type (RGBA)
    header = numpy.concatenate([width_height.view("u1"), numpy.array([8, 6, 0, 0, 0], "u1")])
    headerchunk = chunk(b"IHDR", header.tostring())

    mini, maxi = rgb.min(), rgb.max()                           # normalize the value ranges
    rgb = ((255 / (maxi - mini))*(rgb - mini)).astype("u1")     # and convert to one byte per channel
    data = numpy.empty((rgb.shape[0], 1 + 4*rgb.shape[1]), "u1")
    data[:, 0], data[:, 4::4] = 0, 255                          # beginning of scanline and opacity
    data[:, 1::4], data[:, 2::4], data[:, 3::4] = rgb[:, :, 0], rgb[:, :, 1], rgb[:, :, 2]  # R, G, B
    datachunk = chunk(b"IDAT", zlib.compress(data.tostring()))  # scanlines are zlib-compressed

    endchunk = chunk(b"IEND", b"")                              # IEND means no more chunks
    return numpy.concatenate([preamble, headerchunk, datachunk, endchunk])

(carefully crafted to fit on one screen in a talk).

From beginning to end, the workflow is

  1. interpret 2D histogram bin contents as a NumPy array
  2. copy them into four interleaved np.uint8 arrays, representing RGBA (0 is darkest, 255 is brightest)
  3. constructing the PNG involves zlib.compress and zlib.crc32, and some bytestring concatenation (or equivalently, NumPy concatenation, as in the above example)
  4. base64-encode the bytestring
  5. insert it into the SVG text
  6. web browser interprets the SVG, decoding, decompressing, and rendering the image on the screen

Every step in this process is array-oriented, and all of them are in some compiled code (including base64, just checked). If there's not enough RAM or CPU to perform these steps, then there probably isn't enough RAM or CPU to handle the histogram, either. Plotting wouldn't necessarily be the bottleneck.

jpivarski commented 2 years ago

Pardon all the Python 2isms, like StringIO.StringIO and ndarray.tostring. These things could use a little modernization.

jpivarski commented 2 years ago

Or this: https://datashader.org/

But the advantage of the above method is that it doesn't introduce any new dependencies. I've been trying to figure out what DataShader is—clever NumPy? Cython? pybind11? I'm guessing it's a compiled extension, not just clever NumPy.

Aha! It's Numba.

eduardo-rodrigues commented 2 years ago

Reminds me of the keynote we had at PyHEP 2019, see https://indico.cern.ch/event/833895/contributions/3577846/attachments/1928191/3205023/PyHEP2019_slides.pdf :-).

jpivarski commented 2 years ago

I ran into them again in another project (Pangeo): maybe Awkward Array can be used in SpatialPandas. It's looking like a good option because Awkward Arrays are iterable in Numba—it wouldn't be as good of an option if SpatialPandas or Datashader were based on C++, rather than Numba.

However, for this original issue, displaying 2D histograms in hist, I'd vote for making embedable PNGs with NumPy. Given that you already depend on NumPy, this is a zero-dependency solution, and vectorized. I don't see any reason not to do it that way.

henryiii commented 2 years ago

I very much thing we should do the embeddable PNGs, it just hasn't been done yet. :) The idea of the repr is a simple, zero dependency display. Having docs showing how to integrate with something like data shader if you have a massive histogram would be very interesting, though. :)

eduardo-rodrigues commented 2 years ago

Totally agree with your proposed solution here. I thought I would mention the talk because it is nice to see links between different communities and librairies.