Lazy loading and "array-like" objects.

ioos / APIRUS

API for Regular, Unstructured and Staggered model output (or API R US)

Creative Commons Zero v1.0 Universal

2 stars 1 forks source link

Lazy loading and "array-like" objects. #17

Open ChrisBarker-NOAA opened 8 years ago

ChrisBarker-NOAA commented 8 years ago

So some degree of lazy loading is desirable, even necessary -- you can get some BIG data sets in a file or files....

So we were moving happily along with an asarraylike() function in pyugrid:

https://github.com/pyugrid/pyugrid/blob/master/pyugrid/util.py

with the intent to expand the tests as needed. So far, only testing with netcdf variables as alternate arrays, but the idea was that one could plug in dask, or biggus, or ??? (xray?)

But we've hit a snag. It turns out that netcdf VAriable have different symantics than numpy arrays for "fancy indexing":

""" However, that there are some differences between NumPy and netCDF variable slicing rules. Slices behave as usual, being specified as a start:stop:step triplet. Using a scalar integer index i takes the ith element and reduces the rank of the output array by one. Boolean array and integer sequence indexing behaves differently for netCDF variables than for numpy arrays. Only 1-d boolean arrays and integer sequences are allowed, and these indices work independently along each dimension (similar to the way vector subscripts work in fortran). This means that

temp[0, 0, [0,1,2,3], [0,1,2,3]] returns an array of shape (4,4) when slicing a netCDF variable, but for a numpy array it returns an array of shape (4,). Similarly, a netCDF variable of shape (2,3,4,5) indexed with [0, array([True, False, True]), array([False, True, True, True]), :] would return a (2, 3, 5) array. In NumPy, this would raise an error since it would be equivalent to [0, [0,1], [1,2,3], :]. While this behaviour can cause some confusion for those used to NumPy's 'fancy indexing' rules, it provides a very powerful way to extract data from multidimensional netCDF variables by using logical operations on the dimension arrays to create slices. """

In writing interpolation code, etc, we're needing such fancy indexing. So what to do????

we're moving forward with "bring the slices you need at the moment into memory as numpy arrays" -- so for a 4-d dataset, we probably only need two time slices (for interpolating in time), but may still need all the lat-lon and depth slices for a full 3-d field. I'm thinking that memory is cheap and huge these days, so we're OK. And probably want it all in memory anyway for performance.

-- but other thoughts?

what do dask and biggus provide for fancy indexing?

pinging for thoughts: @ocefpaf @rsignell-usgs @hetland @rhattersley @cbcunc @jay-hennen

hetland commented 8 years ago

Indeed, netCDF only handles start:stop:stride indexing. That's it because that's all that's in the library. I'm not sure about dask or biggus. The only way around this is to load in a whole block, and fancy index from there. In fact, since netCDF is such a necessary dependence, I would make this sort of behavior either standard or mandatory.

On Thu, Jan 21, 2016 at 2:22 PM, Chris Barker notifications@github.com wrote:

So some degree of lazy loading is desirable, even necessary -- you can get some BIG data sets in a file or files....

So we were moving happily along with an asarraylike() function in pyugrid:

https://github.com/pyugrid/pyugrid/blob/master/pyugrid/util.py

with the intent to expand the tests as needed. So far, only testing with netcdf variables as alternate arrays, but the idea was that one could plug in dask, or biggus, or ??? (xray?)

But we've hit a snag. It turns out that netcdf VAriable have different symantics than numpy arrays for "fancy indexing":

""" However, that there are some differences between NumPy and netCDF variable slicing rules. Slices behave as usual, being specified as a start:stop:step triplet. Using a scalar integer index i takes the ith element and reduces the rank of the output array by one. Boolean array and integer sequence indexing behaves differently for netCDF variables than for numpy arrays. Only 1-d boolean arrays and integer sequences are allowed, and these indices work independently along each dimension (similar to the way vector subscripts work in fortran). This means that

temp[0, 0, [0,1,2,3], [0,1,2,3]] returns an array of shape (4,4) when slicing a netCDF variable, but for a numpy array it returns an array of shape (4,). Similarly, a netCDF variable of shape (2,3,4,5) indexed with [0, array([True, False, True]), array([False, True, True, True]), :] would return a (2, 3, 5) array. In NumPy, this would raise an error since it would be equivalent to [0, [0,1], [1,2,3], :]. While this behaviour can cause some confusion for those used to NumPy's 'fancy indexing' rules, it provides a very powerful way to extract data from multidimensional netCDF variables by using logical operations on the dimension arrays to create slices. """

In writing interpolation code, etc, we're needing such fancy indexing. So what to do????

we're moving forward with "bring the slices you need at the moment into memory as numpy arrays" -- so for a 4-d dataset, we probably only need two time slices (for interpolating in time), but may still need all the lat-lon and depth slices for a full 3-d field. I'm thinking that memory is cheap and huge these days, so we're OK. And probably want it all in memory anyway for performance.

-- but other thoughts?

what do dask and biggus provide for fancy indexing?

pinging for thoughts: @ocefpaf https://github.com/ocefpaf @rsignell-usgs https://github.com/rsignell-usgs @hetland https://github.com/hetland @rhattersley https://github.com/rhattersley @cbcunc https://github.com/cbcunc @jay-hennen https://github.com/jay-hennen

— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/17.

Prof. Rob Hetland Texas A&M Univ. – Dept. of Oceanography http://pong.tamu.edu/~rob

ocefpaf commented 8 years ago

So we were moving happily along with an asarraylike() function in pyugrid:

Nice! If you choose dask or biggus they will fit into that model nicely. I am experimenting with dask in pysgrid with some degree of success. More on that soon...

what do dask and biggus provide for fancy indexing?

Dask seems to support them, but I am not sure if that will trigger the download of more data than the necessary to compute the slice. (See http://dask.pydata.org/en/latest/array-overview.html#scope.)

Biggus has the biggus.OrthoArrayAdapter (netCDF4-like orthogonal index, no fancy numpy) and biggus.NumpyArrayAdapter (fancy numpy).

In writing interpolation code, etc, we're needing such fancy indexing. So what to do????

Is that interpolation scheme already in the pyugrid library? (Just curious.)

I'm thinking that memory is cheap and huge these days, so we're OK. And probably want it all in memory anyway for performance.

I believe that the main problem is not loading data into memory, but the downloading that data. Even if you can afford the download it is still nice to make things lazy and load the data payload at the interpolation time only. In sci-wms, for example, we are having performance issue when plotting pysgrid layers because of excessive (and unnecessary) data download. If we can delay all the computations until we have only the right "keys" (netCDF4-python like slices) to access the variables we will have a better performance everywhere.

(Remember that this is coming from someone on the other side of the Americas with a crap internet connection. So I do care a lot about this :grimacing:)

ChrisBarker-NOAA commented 8 years ago

Biggus has the biggus.OrthoArrayAdapter (netCDF4-like orthogonal index, no fancy numpy) and biggus.NumpyArrayAdapter (fancy numpy).

nice -- will biggus wrap around netcdf variable, so we could just use it on to of netcdf and get that functionaility?

In writing interpolation code, etc, we're needing such fancy indexing. So what to do????

Is that interpolation scheme already in the pyugrid library? (Just curious.)

@jay-hennen -- is that code up on gitHub anywhere yet.

We are trying to untangle what goes where, what can be shared between py_ugrid and py_sgrid, and where that shared code should go....

I'm thinking that memory is cheap and huge these days, so we're OK. And probably want it all in memory anyway for performance.

I believe that the main problem is not loading data into memory, but the downloading that data. Even if you can afford the download it is still nice to make things lazy and load the data payload at the interpolation time only.

that was the idea, yes. you'd only load the slice(s) of the data you'd need, and only when it was asked for. But it would be very tricky (particularly with ugrids) to not load an entire timestep at once, anyway.

I think we'll probably do it this way, and then if there are unnecessary data accesses, then we (you?) can optimize that later ;-) Our primary use-case is the data local on disk, so we may not notice some issues.

Remember that this is coming from someone on the other side of the Americas with a crap internet connection. So I do care a lot about this [image: :grimacing:])

then you can test for us :-)

-CHB

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

ocefpaf commented 8 years ago

nice -- will biggus wrap around netcdf variable, so we could just use it on to of netcdf and get that functionality?

Yes. Both biggus and dask will do that.

See this example for my old pyugrid test (check the times in cells 6-8) and this one for a more elaborate example where some computations (calculating the vertical coordinate) will be lazily evaluated on the slices. Note that I could've just sliced all the variables before computing z, but by using dask I can keep computing stuff and worry about that later.

then you can test for us :-)

Will do :smile:

jay-hennen commented 8 years ago

Is that interpolation scheme already in the pyugrid library? (Just curious.)

@ocefpaf @ChrisBarker-NOAA It's in the celltree&interpolation branch, along with a celltree based locate_faces.

To me, the indexing is a design intent question for the UVar/SGridVariable. Imagine a simple case...temperature at the rho points on a SGrid. I want to get the temperature values at four specific x,y indices (not physical coordinates) so another class can use them to interpolate for the temperature at a physical coordinate that I know lies between them all. Question is, can I ask the variable for this temperature information using numpy's fancy indexing?

If not, that doesn't make things impossible, just a little messier. If I want to get these values, I'll have to make a temporary numpy array from the UVar/SGridVariable representing temperature, and then fancy index from there, as @hetland said. From a conceptual standpoing though, It would be quite convenient to be able to think of U/SGVars in the same way as numpy arrays during value lookup, and not have the indexing rules change on you.

rhattersley commented 8 years ago

what do dask and biggus provide for fancy indexing?

Are you confusing integer sequence indexing (which is supported by pretty much everything) with "fancy indexing"? Unless you have a time-varying grid I'd be surprised if you actually wanted fancy indexing. Orthogonal indexing (as for netCDF variables and biggus) is probably a better fit.

For example, if you have a variable on an unstructured grid v[t, z, i] then horizontal interpolation for a single target point over all time and vertical values might need something like v[:, :, (1196, 1201, 14081, 14088)] which can happily be done lazily. If you want to restrict the interpolation to just a few time steps then that would need v[(4, 5, 6), :, (1196, 1201, 14081, 14088)] ... this way of expressing the indexing requires orthogonal indexing and wouldn't be allowed with fancy indexing.

NB. If you're interpolating to a target grid (not just a single point) then you don't want to go through the lazy loading process for each target point. It'd be better to process a whole horizontal slice at a time.

It would be quite convenient to be able to think of U/SGVars in the same way as numpy arrays during value lookup, and not have the indexing rules change on you.

The numpy indexing rules are shifting away from fancy indexing ... https://mail.scipy.org/pipermail/numpy-discussion/2015-December/074425.html

ocefpaf commented 8 years ago

Question is, can I ask the variable for this temperature information using numpy's fancy indexing?

@jay-hennen are you talking about this part of https://github.com/pyugrid/pyugrid/pull/111? If so, see @rhattersley comments above.

ChrisBarker-NOAA commented 8 years ago

On Jan 22, 2016, at 3:30 AM, Filipe notifications@github.com wrote:

Question is, can I ask the variable for this temperature information using numpy's fancy indexing?

@jay-hennen https://github.com/jay-hennen are you talking about this https://github.com/pyugrid/pyugrid/pull/111/files#diff-3ca3650f2458ef1efcc354106a56c5ccR481 part of pyugrid/pyugrid#111 https://github.com/pyugrid/pyugrid/pull/111?

I think that's the CellTree code -- in that case, you really do have to bring the whole thing into memory to build the tree.

If the grid is really huge, yes, that's an issue, and why I've been pushing for years for subsetting on the server, but that's outside of the scope of py[us] grid.

If so, see @rhattersley https://github.com/rhattersley comments above.

— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/17#issuecomment-173890484.

ChrisBarker-NOAA commented 8 years ago

Are you confusing integer sequence indexing (which is supported by pretty much everything) with "fancy indexing"?

I don't think so -- the indexing that netcdf does is a bit confusing to me -- but not what we need.

Imagine, if you will, interpolating to a point in a cell in an unstructured grid. You need the values of the nodes of that cell. They are not contiguous in index space -- so we need fancy indexing to get them.

In a structured grid, then they generally are contiguous in index space (a small rectangular block). However, we generally need to interpolate to lots of points, so want to vectorize this, which requires grabbing data from non-contiguous chunks of the arrays.

Even if we didn't vectorize (maybe putting the loop in Cython), you wouldn't want to grasp four values at a time with separate calls anyway.

I suppose the optimization here would be to compute the bounding box of the points requested, and then load up only that subset of the grid needed to cover those points.

But that wouldn't help with unstructured grids....

For example, if you have a variable on an unstructured grid v[t, z, i] then horizontal interpolation for a single target point over all time and vertical values might need something like v[:, :, (1196, 1201, 14081, 14088)]

This is what I call fancy indexing - and I didn't think it was supported by netcdf. Certainly not in 2-D, which we need for curvilinear grids.

-CHB

which can happily be done lazily. If you want to restrict the interpolation to just a few time steps then that would need v[(4, 5, 6), :, (1196, 1201, 14081, 14088)] ... this way of expressing the indexing requires orthogonal indexing and wouldn't be allowed with fancy indexing.

It would be quite convenient to be able to think of U/SGVars in the same way as numpy arrays during value lookup, and not have the indexing rules change on you.

The numpy indexing rules are shifting away from fancy indexing ... https://mail.scipy.org/pipermail/numpy-discussion/2015-December/074425.html

— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/17#issuecomment-173848529.

jay-hennen commented 8 years ago

@jay-hennen are you talking about this part of pyugrid/pyugrid#111? If so, see @rhattersley > comments above.

No, the situation I'm referring to isn't online anywhere yet. What @rhattersley said about integer sequence indexing is true, and for unstructured grids is the norm, which is why this topic didn't come up a month or two ago.

Now that I'm working on SGrid though, I'm trying to use multidimensional indices, and that's where things get inconsistent.

In [160]: arr
Out[160]:
array([[180, 150, 140],
       [ 85,  43,  32]])

In [161]: df['u']
Out[161]:
<type 'netCDF4._netCDF4.Variable'>
float32 u(ocean_time, s_rho, eta_u, xi_u)
    long_name: u-momentum component
    units: meter second-1
    time: ocean_time
    coordinates: lon_u lat_u s_rho ocean_time
    field: u-velocity, scalar, series
    _FillValue: 1e+37
unlimited dimensions:
current shape = (6, 30, 250, 135)
filling off

In [162]: df['u'][1:3,0,arr[0], arr[1]]
Out[162]:
array([[[ 0.01532938, -0.10250647, -0.11607787],
        [-0.00017436, -0.05592753, -0.02699218],
        [ 0.00474551,  0.00429543, -0.02377137]],

       [[ 0.02342061, -0.10831473, -0.11161784],
        [-0.00208709, -0.08159062, -0.02756767],
        [ 0.00499414, -0.00386864, -0.01764941]]], dtype=float32)

In [163]: df['u'].__array__()[1:3,0,arr[0], arr[1]]
Out[163]:
masked_array(data =
 [[0.015329382382333279 -0.05592752993106842 -0.023771366104483604]
 [0.023420605808496475 -0.08159062266349792 -0.017649410292506218]],
             mask =
 [[False False False]
 [False False False]],
       fill_value = 1e+37)

The second case is what I'm looking for, and what the fancy indexing provides.

The numpy indexing rules are shifting away from fancy indexing ... https://mail.scipy.org/pipermail/numpy-discussion/2015-December/074425.html

@rhattersley Is that anything official or just a proposal? There's already ways to emulate the orthogonal indexing, though it's not as clean.

In [182]: df['u'].__array__()[np.ix_([1,2],[0],arr[0],arr[1])]
Out[182]:
masked_array(data =
 [[[[0.015329382382333279 -0.10250646620988846 -0.11607787013053894]
   [-0.0001743555476423353 -0.05592752993106842 -0.026992175728082657]
   [0.004745508544147015 0.004295430611819029 -0.023771366104483604]]],
 [[[0.023420605808496475 -0.10831473022699356 -0.11161784082651138]
   [-0.0020870890002697706 -0.08159062266349792 -0.02756766602396965]
   [0.004994135349988937 -0.003868643194437027 -0.017649410292506218]]]]

ocefpaf commented 8 years ago

No, the situation I'm referring to isn't online anywhere yet.

OK. I was mislead by your https://github.com/ioos/APIRUS/issues/17#issuecomment-173733544"

@ocefpaf @ChrisBarker-NOAA It's in the celltree&interpolation branch, along with a celltree based locate_faces.

Now that I'm working on SGrid though, I'm trying to use multidimensional indices, and that's where things get inconsistent.

Dask seems to have plans to support that but it raises a NotImplementError right now. (Due to the long term plan to move away from fancy indexing I am not sure they will ever implement it.) I don't believe biggus can do that either, but @rhattersley can comment on that.

Note that there are good reasons to move away from fancy indexing. (See http://numpy-discussion.10968.n7.nabble.com/Advanced-indexing-quot-fancy-quot-vs-orthogonal-td40046.html#a40066.)

If the code can be written without them I would suggested to do so. If not, there is not escape loading everything into memory.

ChrisBarker-NOAA commented 8 years ago

On Fri, Jan 22, 2016 at 10:10 AM, Filipe notifications@github.com wrote:

Note that there are good reasons to move away from fancy indexing. (See http://numpy-discussion.10968.n7.nabble.com/Advanced-indexing-quot-fancy-quot-vs-orthogonal-td40046.html#a40066. )

well, there are good reasons to have orthogonal indexing -- but element-wise indexing is really useful too. It all depends on the use cse.

Just like hdf data chunking, etc, the most efficient way to do something depends on your access patterns.

WE need to interpolate a whole bunch of points all at one time -- @rhattersley https://github.com/rhattersley's example was a time series at a single point -- these are very different access patterns, and require different aproaches.

So we'll write the code in a way that works for us -- then you all can:

suggest optimizations and/or
offer alternative implementations that fit different access patterns

Hopefully with the same API....

-CHB

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov