Open ChrisBarker-NOAA opened 8 years ago
Indeed, netCDF only handles start:stop:stride indexing. That's it because that's all that's in the library. I'm not sure about dask or biggus. The only way around this is to load in a whole block, and fancy index from there. In fact, since netCDF is such a necessary dependence, I would make this sort of behavior either standard or mandatory.
On Thu, Jan 21, 2016 at 2:22 PM, Chris Barker notifications@github.com wrote:
So some degree of lazy loading is desirable, even necessary -- you can get some BIG data sets in a file or files....
So we were moving happily along with an asarraylike() function in pyugrid:
https://github.com/pyugrid/pyugrid/blob/master/pyugrid/util.py
with the intent to expand the tests as needed. So far, only testing with netcdf variables as alternate arrays, but the idea was that one could plug in dask, or biggus, or ??? (xray?)
But we've hit a snag. It turns out that netcdf VAriable have different symantics than numpy arrays for "fancy indexing":
""" However, that there are some differences between NumPy and netCDF variable slicing rules. Slices behave as usual, being specified as a start:stop:step triplet. Using a scalar integer index i takes the ith element and reduces the rank of the output array by one. Boolean array and integer sequence indexing behaves differently for netCDF variables than for numpy arrays. Only 1-d boolean arrays and integer sequences are allowed, and these indices work independently along each dimension (similar to the way vector subscripts work in fortran). This means that
temp[0, 0, [0,1,2,3], [0,1,2,3]] returns an array of shape (4,4) when slicing a netCDF variable, but for a numpy array it returns an array of shape (4,). Similarly, a netCDF variable of shape (2,3,4,5) indexed with [0, array([True, False, True]), array([False, True, True, True]), :] would return a (2, 3, 5) array. In NumPy, this would raise an error since it would be equivalent to [0, [0,1], [1,2,3], :]. While this behaviour can cause some confusion for those used to NumPy's 'fancy indexing' rules, it provides a very powerful way to extract data from multidimensional netCDF variables by using logical operations on the dimension arrays to create slices. """
In writing interpolation code, etc, we're needing such fancy indexing. So what to do????
we're moving forward with "bring the slices you need at the moment into memory as numpy arrays" -- so for a 4-d dataset, we probably only need two time slices (for interpolating in time), but may still need all the lat-lon and depth slices for a full 3-d field. I'm thinking that memory is cheap and huge these days, so we're OK. And probably want it all in memory anyway for performance.
-- but other thoughts?
what do dask and biggus provide for fancy indexing?
pinging for thoughts: @ocefpaf https://github.com/ocefpaf @rsignell-usgs https://github.com/rsignell-usgs @hetland https://github.com/hetland @rhattersley https://github.com/rhattersley @cbcunc https://github.com/cbcunc @jay-hennen https://github.com/jay-hennen
— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/17.
Prof. Rob Hetland Texas A&M Univ. – Dept. of Oceanography http://pong.tamu.edu/~rob
So we were moving happily along with an asarraylike() function in pyugrid:
Nice! If you choose dask
or biggus
they will fit into that model nicely. I am experimenting with dask
in pysgrid
with some degree of success. More on that soon...
what do dask and biggus provide for fancy indexing?
Dask seems to support them, but I am not sure if that will trigger the download of more data than the necessary to compute the slice. (See http://dask.pydata.org/en/latest/array-overview.html#scope.)
Biggus has the biggus.OrthoArrayAdapter
(netCDF4-like orthogonal index, no fancy numpy) and biggus.NumpyArrayAdapter
(fancy numpy).
In writing interpolation code, etc, we're needing such fancy indexing. So what to do????
Is that interpolation scheme already in the pyugrid
library? (Just curious.)
I'm thinking that memory is cheap and huge these days, so we're OK. And probably want it all in memory anyway for performance.
I believe that the main problem is not loading data into memory, but the downloading that data. Even if you can afford the download it is still nice to make things lazy and load the data payload at the interpolation time only. In sci-wms
, for example, we are having performance issue when plotting pysgrid
layers because of excessive (and unnecessary) data download. If we can delay all the computations until we have only the right "keys" (netCDF4-python like slices) to access the variables we will have a better performance everywhere.
(Remember that this is coming from someone on the other side of the Americas with a crap internet connection. So I do care a lot about this :grimacing:)
Biggus has the biggus.OrthoArrayAdapter (netCDF4-like orthogonal index, no fancy numpy) and biggus.NumpyArrayAdapter (fancy numpy).
nice -- will biggus wrap around netcdf variable, so we could just use it on to of netcdf and get that functionaility?
In writing interpolation code, etc, we're needing such fancy indexing. So what to do????
Is that interpolation scheme already in the pyugrid library? (Just curious.)
@jay-hennen -- is that code up on gitHub anywhere yet.
We are trying to untangle what goes where, what can be shared between py_ugrid and py_sgrid, and where that shared code should go....
I'm thinking that memory is cheap and huge these days, so we're OK. And probably want it all in memory anyway for performance.
I believe that the main problem is not loading data into memory, but the downloading that data. Even if you can afford the download it is still nice to make things lazy and load the data payload at the interpolation time only.
that was the idea, yes. you'd only load the slice(s) of the data you'd need, and only when it was asked for. But it would be very tricky (particularly with ugrids) to not load an entire timestep at once, anyway.
I think we'll probably do it this way, and then if there are unnecessary data accesses, then we (you?) can optimize that later ;-) Our primary use-case is the data local on disk, so we may not notice some issues.
Remember that this is coming from someone on the other side of the Americas with a crap internet connection. So I do care a lot about this [image: :grimacing:])
then you can test for us :-)
-CHB
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
nice -- will biggus wrap around netcdf variable, so we could just use it on to of netcdf and get that functionality?
Yes. Both biggus and dask will do that.
See this example for my old pyugrid test (check the times in cells 6-8) and this one for a more elaborate example where some computations (calculating the vertical coordinate) will be lazily evaluated on the slices. Note that I could've just sliced all the variables before computing z
, but by using dask I can keep computing stuff and worry about that later.
then you can test for us :-)
Will do :smile:
Is that interpolation scheme already in the pyugrid library? (Just curious.)
@ocefpaf @ChrisBarker-NOAA It's in the celltree&interpolation branch, along with a celltree based locate_faces.
To me, the indexing is a design intent question for the UVar/SGridVariable. Imagine a simple case...temperature at the rho points on a SGrid. I want to get the temperature values at four specific x,y indices (not physical coordinates) so another class can use them to interpolate for the temperature at a physical coordinate that I know lies between them all. Question is, can I ask the variable for this temperature information using numpy's fancy indexing?
If not, that doesn't make things impossible, just a little messier. If I want to get these values, I'll have to make a temporary numpy array from the UVar/SGridVariable representing temperature, and then fancy index from there, as @hetland said. From a conceptual standpoing though, It would be quite convenient to be able to think of U/SGVars in the same way as numpy arrays during value lookup, and not have the indexing rules change on you.
what do dask and biggus provide for fancy indexing?
Are you confusing integer sequence indexing (which is supported by pretty much everything) with "fancy indexing"? Unless you have a time-varying grid I'd be surprised if you actually wanted fancy indexing. Orthogonal indexing (as for netCDF variables and biggus) is probably a better fit.
For example, if you have a variable on an unstructured grid v[t, z, i]
then horizontal interpolation for a single target point over all time and vertical values might need something like v[:, :, (1196, 1201, 14081, 14088)]
which can happily be done lazily. If you want to restrict the interpolation to just a few time steps then that would need v[(4, 5, 6), :, (1196, 1201, 14081, 14088)]
... this way of expressing the indexing requires orthogonal indexing and wouldn't be allowed with fancy indexing.
NB. If you're interpolating to a target grid (not just a single point) then you don't want to go through the lazy loading process for each target point. It'd be better to process a whole horizontal slice at a time.
It would be quite convenient to be able to think of U/SGVars in the same way as numpy arrays during value lookup, and not have the indexing rules change on you.
The numpy indexing rules are shifting away from fancy indexing ... https://mail.scipy.org/pipermail/numpy-discussion/2015-December/074425.html
Question is, can I ask the variable for this temperature information using numpy's fancy indexing?
@jay-hennen are you talking about this part of https://github.com/pyugrid/pyugrid/pull/111? If so, see @rhattersley comments above.
On Jan 22, 2016, at 3:30 AM, Filipe notifications@github.com wrote:
Question is, can I ask the variable for this temperature information using numpy's fancy indexing?
@jay-hennen https://github.com/jay-hennen are you talking about this https://github.com/pyugrid/pyugrid/pull/111/files#diff-3ca3650f2458ef1efcc354106a56c5ccR481 part of pyugrid/pyugrid#111 https://github.com/pyugrid/pyugrid/pull/111?
I think that's the CellTree code -- in that case, you really do have to bring the whole thing into memory to build the tree.
If the grid is really huge, yes, that's an issue, and why I've been pushing for years for subsetting on the server, but that's outside of the scope of py[us] grid.
If so, see @rhattersley https://github.com/rhattersley comments above.
— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/17#issuecomment-173890484.
Are you confusing integer sequence indexing (which is supported by pretty much everything) with "fancy indexing"?
I don't think so -- the indexing that netcdf does is a bit confusing to me -- but not what we need.
Imagine, if you will, interpolating to a point in a cell in an unstructured grid. You need the values of the nodes of that cell. They are not contiguous in index space -- so we need fancy indexing to get them.
In a structured grid, then they generally are contiguous in index space (a small rectangular block). However, we generally need to interpolate to lots of points, so want to vectorize this, which requires grabbing data from non-contiguous chunks of the arrays.
Even if we didn't vectorize (maybe putting the loop in Cython), you wouldn't want to grasp four values at a time with separate calls anyway.
I suppose the optimization here would be to compute the bounding box of the points requested, and then load up only that subset of the grid needed to cover those points.
But that wouldn't help with unstructured grids....
For example, if you have a variable on an unstructured grid v[t, z, i] then horizontal interpolation for a single target point over all time and vertical values might need something like v[:, :, (1196, 1201, 14081, 14088)]
This is what I call fancy indexing - and I didn't think it was supported by netcdf. Certainly not in 2-D, which we need for curvilinear grids.
-CHB
which can happily be done lazily. If you want to restrict the interpolation to just a few time steps then that would need v[(4, 5, 6), :, (1196, 1201, 14081, 14088)] ... this way of expressing the indexing requires orthogonal indexing and wouldn't be allowed with fancy indexing.
NB. If you're interpolating to a target grid (not just a single point) then you don't want to go through the lazy loading process for each target point. It'd be better to process a whole horizontal slice at a time.
It would be quite convenient to be able to think of U/SGVars in the same way as numpy arrays during value lookup, and not have the indexing rules change on you.
The numpy indexing rules are shifting away from fancy indexing ... https://mail.scipy.org/pipermail/numpy-discussion/2015-December/074425.html
— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/17#issuecomment-173848529.
@jay-hennen are you talking about this part of pyugrid/pyugrid#111? If so, see @rhattersley > comments above.
No, the situation I'm referring to isn't online anywhere yet. What @rhattersley said about integer sequence indexing is true, and for unstructured grids is the norm, which is why this topic didn't come up a month or two ago.
Now that I'm working on SGrid though, I'm trying to use multidimensional indices, and that's where things get inconsistent.
In [160]: arr
Out[160]:
array([[180, 150, 140],
[ 85, 43, 32]])
In [161]: df['u']
Out[161]:
<type 'netCDF4._netCDF4.Variable'>
float32 u(ocean_time, s_rho, eta_u, xi_u)
long_name: u-momentum component
units: meter second-1
time: ocean_time
coordinates: lon_u lat_u s_rho ocean_time
field: u-velocity, scalar, series
_FillValue: 1e+37
unlimited dimensions:
current shape = (6, 30, 250, 135)
filling off
In [162]: df['u'][1:3,0,arr[0], arr[1]]
Out[162]:
array([[[ 0.01532938, -0.10250647, -0.11607787],
[-0.00017436, -0.05592753, -0.02699218],
[ 0.00474551, 0.00429543, -0.02377137]],
[[ 0.02342061, -0.10831473, -0.11161784],
[-0.00208709, -0.08159062, -0.02756767],
[ 0.00499414, -0.00386864, -0.01764941]]], dtype=float32)
In [163]: df['u'].__array__()[1:3,0,arr[0], arr[1]]
Out[163]:
masked_array(data =
[[0.015329382382333279 -0.05592752993106842 -0.023771366104483604]
[0.023420605808496475 -0.08159062266349792 -0.017649410292506218]],
mask =
[[False False False]
[False False False]],
fill_value = 1e+37)
The second case is what I'm looking for, and what the fancy indexing provides.
The numpy indexing rules are shifting away from fancy indexing ... https://mail.scipy.org/pipermail/numpy-discussion/2015-December/074425.html
@rhattersley Is that anything official or just a proposal? There's already ways to emulate the orthogonal indexing, though it's not as clean.
In [182]: df['u'].__array__()[np.ix_([1,2],[0],arr[0],arr[1])]
Out[182]:
masked_array(data =
[[[[0.015329382382333279 -0.10250646620988846 -0.11607787013053894]
[-0.0001743555476423353 -0.05592752993106842 -0.026992175728082657]
[0.004745508544147015 0.004295430611819029 -0.023771366104483604]]],
[[[0.023420605808496475 -0.10831473022699356 -0.11161784082651138]
[-0.0020870890002697706 -0.08159062266349792 -0.02756766602396965]
[0.004994135349988937 -0.003868643194437027 -0.017649410292506218]]]]
No, the situation I'm referring to isn't online anywhere yet.
OK. I was mislead by your https://github.com/ioos/APIRUS/issues/17#issuecomment-173733544"
@ocefpaf @ChrisBarker-NOAA It's in the celltree&interpolation branch, along with a celltree based locate_faces.
Now that I'm working on SGrid though, I'm trying to use multidimensional indices, and that's where things get inconsistent.
Dask seems to have plans to support that but it raises a NotImplementError right now. (Due to the long term plan to move away from fancy indexing I am not sure they will ever implement it.) I don't believe biggus can do that either, but @rhattersley can comment on that.
Note that there are good reasons to move away from fancy indexing. (See http://numpy-discussion.10968.n7.nabble.com/Advanced-indexing-quot-fancy-quot-vs-orthogonal-td40046.html#a40066.)
If the code can be written without them I would suggested to do so. If not, there is not escape loading everything into memory.
On Fri, Jan 22, 2016 at 10:10 AM, Filipe notifications@github.com wrote:
Note that there are good reasons to move away from fancy indexing. (See http://numpy-discussion.10968.n7.nabble.com/Advanced-indexing-quot-fancy-quot-vs-orthogonal-td40046.html#a40066. )
well, there are good reasons to have orthogonal indexing -- but element-wise indexing is really useful too. It all depends on the use cse.
Just like hdf data chunking, etc, the most efficient way to do something depends on your access patterns.
WE need to interpolate a whole bunch of points all at one time -- @rhattersley https://github.com/rhattersley's example was a time series at a single point -- these are very different access patterns, and require different aproaches.
So we'll write the code in a way that works for us -- then you all can:
Hopefully with the same API....
-CHB
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
So some degree of lazy loading is desirable, even necessary -- you can get some BIG data sets in a file or files....
So we were moving happily along with an asarraylike() function in pyugrid:
https://github.com/pyugrid/pyugrid/blob/master/pyugrid/util.py
with the intent to expand the tests as needed. So far, only testing with netcdf variables as alternate arrays, but the idea was that one could plug in dask, or biggus, or ??? (xray?)
But we've hit a snag. It turns out that netcdf VAriable have different symantics than numpy arrays for "fancy indexing":
""" However, that there are some differences between NumPy and netCDF variable slicing rules. Slices behave as usual, being specified as a start:stop:step triplet. Using a scalar integer index i takes the ith element and reduces the rank of the output array by one. Boolean array and integer sequence indexing behaves differently for netCDF variables than for numpy arrays. Only 1-d boolean arrays and integer sequences are allowed, and these indices work independently along each dimension (similar to the way vector subscripts work in fortran). This means that
In writing interpolation code, etc, we're needing such fancy indexing. So what to do????
we're moving forward with "bring the slices you need at the moment into memory as numpy arrays" -- so for a 4-d dataset, we probably only need two time slices (for interpolating in time), but may still need all the lat-lon and depth slices for a full 3-d field. I'm thinking that memory is cheap and huge these days, so we're OK. And probably want it all in memory anyway for performance.
-- but other thoughts?
what do dask and biggus provide for fancy indexing?
pinging for thoughts: @ocefpaf @rsignell-usgs @hetland @rhattersley @cbcunc @jay-hennen