Closed lucduron closed 8 months ago
great feature implementation! (3D to 2D) ping @sebourban you might be interested in seeing this feature.
- Fix time selection (was not working with an integer)
What exactly was not working? because I had tested it using diverse integers (even -1
for instance) and didn't have problems.
I just added a commit to enable drop_vars (variable selection).
Concerning your question, len(ds.time)
was raising a TypeError: len() of unsized object
.
Here is my code to debug:
import xarray as xr
from xarray_selafin_backend.xarray_backend import SelafinBackendEntrypoint
slf_in = "tests/data/r3d_tidal_flats.slf"
ds_slf = xr.open_dataset(slf_in, engine=SelafinBackendEntrypoint)
ds_slf.selafin.write("test.slf", file_format="SERAFIN") # keep it single precision
ds_slice1 = ds_slf.isel(time=slice(0, 10))
ds_slice1.selafin.write("test_0-10.slf", file_format="SERAFIN")
ds_slice2 = ds_slf.isel(time=0)
ds_slice2.selafin.write("test_0.slf", file_format="SERAFIN")
And below is the error traceback:
Traceback (most recent call last):
File "C:\softs\WinPython-64bits-3.11.5.0\python-3.11.5.amd64\Lib\site-packages\xarray\core\utils.py", line 578, in __len__
return self.shape[0]
~~~~~~~~~~^^^
IndexError: tuple index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\PROJETS\python\xarray-selafin\debug_LDN.py", line 14, in <module>
ds_slice2.selafin.write("test_0.slf", file_format="SERAFIN")
File "C:\PROJETS\python\xarray-selafin\xarray_selafin_backend\xarray_backend.py", line 288, in write
write_serafin(filepath, ds, file_format)
File "C:\PROJETS\python\xarray-selafin\xarray_selafin_backend\xarray_backend.py", line 51, in write_serafin
slf_header.nb_frames = len(ds.time)
^^^^^^^^^^^^
File "C:\softs\WinPython-64bits-3.11.5.0\python-3.11.5.amd64\Lib\site-packages\xarray\core\dataarray.py", line 728, in __len__
return len(self.variable)
^^^^^^^^^^^^^^^^^^
File "C:\softs\WinPython-64bits-3.11.5.0\python-3.11.5.amd64\Lib\site-packages\xarray\core\utils.py", line 580, in __len__
raise TypeError("len() of unsized object")
TypeError: len() of unsized object
This is the reason why I switched to ds.time.size
and did other fixes.
A more interesting converter from 3D to 2D should be able to :
BOTTOM
, FREE SURFACE
and WATER DEPTH
from ELEVATION Z
It is already implemented in PyTelTools in slf_3d_to_2d.py through pyteltools.slf.misc.VerticalMaxMinMeanCalculator.max_min_mean_in_frame.
Hi @lucduron, I still have tests failing for dask. It seems to be the same error you mentioned in the other PR.
Also I wanted to implement to vector indexing in dask, like so:
if isinstance(node_key, slice):
node_indices = range(*node_key.indices(self.shape[1]))
elif isinstance(node_key, int):
node_indices = [node_key]
elif isinstance(node_key, np.ndarray[int]):
node_indices = node_key
else:
raise ValueError("node_key must be an integer, nd.array[int] or slice")
Which does not seem to work. I tried using this script:
import xarray as xr
import time
import numpy as np
from scipy.spatial import cKDTree
def closest_n_points(nodes, N, meshXY):
def do_kdtree(meshXY, points, N):
mytree = cKDTree(meshXY)
dist, indexes = mytree.query(points, range(1, N + 1))
return indexes
return do_kdtree(meshXY, nodes, N)
def main():
def xrSlf(slf_in):
# read test
ds_slf = xr.open_dataset(slf_in, engine="selafin")
# get coords
coords = np.array([[0, 100, 200, 300, 500], [0, 100, 200, 300, 500]])
nodes = closest_n_points(coords.T, 1, np.column_stack((ds_slf.x.values, ds_slf.y.values))).T[0]
ds_slice3 = ds_slf.isel(node = nodes)
Z = ds_slice3.S.isel(time=-1, plan=0).compute()
print(Z)
print(Z.values)
file_2d = "tests/data/r2d_tidal_flats.slf"
xrSlf(file_2d)
main()
Oviously, using xarray is works well, but using dask I have this error:
<xarray.DataArray 'S' (node: 5)> Size: 40B
dask.array<getitem, shape=(5,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray>
Coordinates:
x (node) float32 20B -6.131 -6.131 158.0 158.0 390.6
y (node) float32 20B 10.33 10.33 263.2 263.2 500.0
time float64 8B 1.6e+05
Dimensions without coordinates: node
Traceback (most recent call last):
File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/array/chunk.py", line 420, in getitem
result = obj[index]
~~~^^^^^^^
IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/tomsail/work/python/seareport_org/xarray-selafin-ld/test_ld.py", line 46, in <module>
main()
File "/home/tomsail/work/python/seareport_org/xarray-selafin-ld/test_ld.py", line 44, in main
xrSlf(file_2d)
File "/home/tomsail/work/python/seareport_org/xarray-selafin-ld/test_ld.py", line 24, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tomsail/work/python/seareport_org/xarray-selafin-ld/test_ld.py", line 41, in xrSlf
print(Z.values)
^^^^^^^^
File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/xarray/core/dataarray.py", line 770, in values
return self.variable.values
^^^^^^^^^^^^^^^^^^^^
File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/xarray/core/variable.py", line 508, in values
return _as_array_or_item(self._data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/xarray/core/variable.py", line 306, in _as_array_or_item
data = np.asarray(data)
^^^^^^^^^^^^^^^^
File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/array/core.py", line 1700, in __array__
x = self.compute()
^^^^^^^^^^^^^^
File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/base.py", line 375, in compute
(result,) = compute(self, traverse=False, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/base.py", line 661, in compute
results = schedule(dsk, keys, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/array/chunk.py", line 422, in getitem
raise ValueError(
ValueError: Array chunk size or shape is unknown. Possible solution with x.compute_chunk_sizes()
@pmav99 would you have any idea why?
It seems to me that there is a confusion between lazy loading and dask. Could we implement lazy loading without Dask? If yes, I can add a commit to support lazy loading in both cases... But the dask implementation will still have to be implemented. Besides, dask implementation might change the way files are open:
In order to support Dask Distributed and multiprocessing, BackendArray subclass should be serializable either with Pickle or cloudpickle. That implies that all the reference to open files should be dropped. For opening files, we therefore suggest to use the helper class provided by Xarray CachingFileManager.
Source: https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html#backendarray-subclassing
Thanks Luc, Indeed lazy loading does not necessarily need dask implemented. I confirm we should have lazy loading in both cases and worry about the dask implementation down the line (really not a priority now).
If you can provide lazy loading for both cases, that'll be greatly appreciated! otherwise I can do it.
The current implementation of SelafinBackendEntrypoint
is definitly not satisfactory, it is very slow (around 45 times slower than by pure PyTeTools methods).
Here is an example to benchmark reading (same order if I am right):
import numpy as np
import xarray as xr
from xarray_selafin_backend.xarray_backend import Serafin, SelafinBackendEntrypoint
import time
slf_in = "tests/data/r3d_tidal_flats.slf"
#slf_in = "r3d_last.slf"
t1 = time.perf_counter()
ds_slf = xr.open_dataset(slf_in, engine=SelafinBackendEntrypoint)
ds_slf.load()
t2 = time.perf_counter()
da = t2 - t1
print(da) # 0.4430556000006618
t1 = time.perf_counter()
with Serafin.Read(slf_in, 'en') as resin:
resin.read_header()
resin.get_time()
header = resin.header
values = np.empty((header.nb_var, len(resin.time), header.nb_nodes_2d, header.nb_planes), dtype=header.np_float_type)
for iv, var in enumerate(header.var_IDs):
for time_index, t in enumerate(resin.time):
array = resin.read_var_in_frame(time_index, var).reshape(header.nb_nodes_2d, header.nb_planes)
values[iv, time_index, :, :] = array
t2 = time.perf_counter()
db = t2 - t1
print(db) # 0.009758399999554968
print("RATIO = %s" % (da / db)) # 45.40248401591115
I do not know where it comes from...
Have you also tried on bigger files? do you have the same ratio?
Also I can't succeed to export to netcdf / zarr on 2D meshes:
FAILED tests/io_test.py::test_to_netcdf[2D] - IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed
FAILED tests/io_test.py::test_to_selafin[2D] - IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed
FAILED tests/io_test.py::test_slice[2D] - IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
EDIT: actually I had it before and my "dirty fix" was to bypass the _raw_indexing_method()
function.. not ideal
only 2 tests failing now. (slicing + netcdf export)
SelafinLazyArray
seems really buggy and slow, I tried to rewrite it but did not succeed yet...
Maybe the best solution for now would be the revert back to the eager (i.e. normal) mode. No dask, no lazy loading
That would be fine for me already
I did a commit, the (uncommented) tests should pass but it might break some features (such as 3D to 2D for exemple).
You pushed it first :) thanks. I was wondering why my push wasn't working.
I confirm all tests pass here.
I'm experiencing the same performance on big files. good job ! may I implement the time fix?
(changes are a bit more significant only because of the .pre-commit CI)
I think we can merge. let me know if you have anything else to add before
fixed #10 with last commit