holoviz / holoviews

With Holoviews, your data visualizes itself.
https://holoviews.org
BSD 3-Clause "New" or "Revised" License
2.69k stars 402 forks source link

Raster/Image types should report gridded shape #2895

Open ceball opened 6 years ago

ceball commented 6 years ago

I think that anywhere python's int or np.int is used as a dtype, that means on windows it'll end up as np.int32. numpy's default int type is C long (https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html), which on windows is 4 bytes. If using np.int for indexing, at best this will unnecessarily restrict the sizes of stuff on windows (e.g. getting 'array size too large' or whatever), but could also lead to garbage results.

One example of the problem in hv, from https://github.com/ioam/holoviews/blob/master/holoviews/core/data/image.py#L82, on 64-bit windows:

screen shot 2018-07-26 at 10 17 30 am

We should consider if there are other places where this kind of thing could be happening, and fix them all.

jbednar commented 6 years ago

Thanks.

For this particular problem, I think this code reproduces the issue:

import numpy as np, holoviews as hv
shape = (10000, 300000)
img = hv.Image(np.ndarray(shape))
print(shape, img.data.shape, type(img.data.shape[0]), img.shape, type(img.shape[0]))

If this is called on a Mac OS X platform and run using python shape.py, the result is:

(10000, 300000) (10000, 300000) <class 'int'> (3000000000, 3) <class 'numpy.int64'>

I don't have a Windows 10 platform accessible to test, but I'm told it produces numpy.int32 and a negative shape:

(10000, 300000) (10000, 300000) <class 'int'> (-1294967296, 3) <class 'numpy.int32'>

I had thought that making this change might help force the output type to numpy.int64, but apparently not:

shape = (np.int64(10000), np.int64(300000))

But adding , dtype=np.intp to https://github.com/ioam/holoviews/blob/master/holoviews/core/data/image.py#L82 does fix the problem?

jbednar commented 6 years ago

As an aside, does anyone know why the shape is returned as a scalar 300000000 in the above case? It's clear what's happening in the code:

    def shape(cls, dataset, gridded=False):
        if gridded:  return dataset.data.shape[:2]
        else: return cls.length(dataset), len(dataset.dimensions())

    def length(cls, dataset):
        return np.product(dataset.data.shape[:2])

But what's not clear to me is why; why should an image not be returning the width x height x channels (10000, 300000, 3) as the shape?

ceball commented 6 years ago

But adding , dtype=np.intp to https://github.com/ioam/holoviews/blob/master/holoviews/core/data/image.py#L82 does fix the problem?

Yes, it fixes the problem in your reproducer above.

ceball commented 6 years ago

I had thought that making this change might help force the output type to numpy.int64, but apparently not:

shape = (np.int64(10000), np.int64(300000))

Was that maybe based on what I wrote in the original post, e.g.

>>> np.product((int(1),int(1))).dtype
dtype('int32')

>>> np.product((np.int64(1),int(1))).dtype
dtype('int64')

But the specific numpy types you supply for the shape don't make it to the product - array shape always comes back as int [note1]:

>>> shape = np.ndarray( (np.int64(1),np.int32(1)) ).shape
>>> print(type(shape[0]),type(shape[1])) 
<class 'int'> <class 'int'>

And when a python int is supplied as the first argument of np.product(), numpy takes that to be "the platform int" i.e. int32 if it's less than 2**31.


note1: I'm just talking about python 3, where there's only int (I mean python int). On python 2, I vaguely recall shape would always come back as long (python long) on win, in which case this whole problem here might not occur at all...but I don't (want to) remember.

ceball commented 6 years ago

On python 2, I vaguely recall shape would always come back as long (python long) on win, in which case this whole problem here might not occur at all...but I don't (want to) remember.

Ok, couldn't help myself...this particular problem doesn't happen with python 2.7:

(10000, 300000) (10000L, 300000L) <type 'long'> (3000000000, 3) <type 'numpy.int64'>

python 3.6:

(10000, 300000) (10000, 300000) <class 'int'> (-1294967296, 3) <class 'numpy.int32'>

(not exactly same versions of hv and np, but don't think that matters)

jbednar commented 6 years ago

Good sleuthing. That's a good reason why we wouldn't have caught this problem before; sounds like it only affects 64-bit Windows Python 3 users.

ceball commented 6 years ago

I've made #2903 for the set of problems related to grid interfaces, but I think the problem occurs in other places too, so this issue should stay open even if #2903 is merged. It could be assigned to me.

Ideally I wanted to get windows tests running for hv on appveyor, but I started that by trying to update hv's tests for pyctdev, but that's turning into a large PR affecting packaging and testing. I know from previous discussions that I need to break that down into small chunks for hv, so I decided to pause that for now and reconsider how to attack.

Meanwhile, does anyone else think numpy should be warning of the overflow in product()?

import numpy as np

shape = np.int32(2), np.iinfo(np.int32).max
print(shape)

print("no indication of overflow")
print(np.product(shape))
print("indication of overflow")
print(np.int32(shape[0]) * np.int32(shape[1]))
(2, 2147483647)
no indication of overflow
-2
indication of overflow
np2.py:9: RuntimeWarning: overflow encountered in long_scalars
  print(np.int32(shape[0]) * np.int32(shape[1]))
-2
philippjfr commented 6 years ago

But what's not clear to me is why; why should an image not be returning the width x height x channels (10000, 300000, 3) as the shape?

It's simple an API consistency issue, grid interfaces have both a gridded and a columnar representation, e.g. you can view a gridded dataset as a table, which is of course horrendously inefficient, but still valid.

jbednar commented 6 years ago

I figured it was something like that, and it makes sense that len() would provide the product of the width and height. But if someone is asking for the shape, shouldn't they be expected to deal with the fact that the shape is m x n x c? That seems like the point of requesting the shape, i.e. to get the actual shape of the data so that it can be processed appropriately. Requesting the len() implies that you are treating it as a sequence, but requesting the shape() doesn't seem to indicate a columnar or sequential assumption; just the opposite!

philippjfr commented 6 years ago

This is a method on the interface and therefore never called directly by a user. It's there for API consistency and I don't see that changing.

jlstevens commented 6 years ago

This is a method on the interface and therefore never called directly by a user.

Unless I am misunderstanding something, isn't 3000000000 being reported for img.shape?

jbednar commented 6 years ago

Right; img.shape doesn't seem like some implementation detail...

philippjfr commented 6 years ago

I forgot about the Dataset.shape property I have no objection to changing that on Raster/Image/QuadMesh types but the question is what happens if you pass a gridded dataset to something like Points or VectorField, which shape should report the flattened shape or the gridded shape?

jbednar commented 6 years ago

I can't think of a case where I would want .shape to be anything but the actual multidimensional shape; .shape isn't something I would expect to be calling on a flattened view anyway (just len()). But I may not be able to imagine all the corner cases.

philippjfr commented 6 years ago

So you're saying you want Points.shape to return different things depending on the type of the data, that doesn't seem like a good idea to me.

jbednar commented 6 years ago

Hmm. I guess I think the appropriate shape depends on the Element type, not the underlying data storage. What's clear to me is that an hv.RGB object should have a shape that's width x height x channels; it was extremely surprising to see anything else returned from .shape. As long as the hv.RGB object supports indexing over that shape (img[h][w][c]), it's not so important what the underlying storage format is; this is the semantic or logical shape of that data.

So to answer your question, if an xarray DataArray is passed to hv.Points, I guess hv.Points.shape should be the flattened product, whereas if the same DataArray is passed to hv.Image, it would be height x width, in both cases matching how we expect people to iterate over that Element.

jlstevens commented 6 years ago

Hmm. I guess I think the appropriate shape depends on the Element type, not the underlying data storage.

I agree. Shape should work based off the element semantics, not the data backend. Raster types should never have columnar data (or be purely columnar) so I would always report a multi-dimensional shape for rasters.

I have no strong opinion about other element types: in those cases it doesn't bother me much if the shape is essentially the same as the length.

philippjfr commented 6 years ago

Repurposed this issue now that the original issue is fixed.

ceball commented 6 years ago

It seems a bit confusing to me to change what this issue is about when there may be other places in hv that (implicitly) assume "int" will be int64 on all 64-bit platforms when on win64/python3 it'll be int32. I.e. I think this issue should remain open as originally titled until someone checks for other places where int32 might end up being used by mistake (and somewhere I think I suggested this issue could be assigned to me...i.e. I'm offering to do it, although not immediately). #2903 only covered shape/length reporting for grid-based interfaces, which is all I checked to start with.

At the same time, I realize you're more concerned with the more general, non-platform-specific issue that's the subject now, and that there's discussion about that here too. Since the title's already been changed, and an issue with the current title should definitely exist, shall I make a new issue saying that we should check to see if there are more cases of this win64/python3/int32 problem? (Or do you prefer to wait to see if a concrete problem shows up?)