Fix time selection, plan selection (by integer only), improve reading

lucduron commented 9 months ago

Fix time selection (was not working with an integer)
Fix plan selection (by integer only) => converts a 3D file into a 2D file
Avoid multiple calls to read_vars_at_frame (with a single variable selection) => use read_var_at_frame instead

tomsail commented 9 months ago

great feature implementation! (3D to 2D) ping @sebourban you might be interested in seeing this feature.

Fix time selection (was not working with an integer)

What exactly was not working? because I had tested it using diverse integers (even -1 for instance) and didn't have problems.

lucduron commented 9 months ago

I just added a commit to enable drop_vars (variable selection).

Concerning your question, len(ds.time) was raising a TypeError: len() of unsized object.

Here is my code to debug:

import xarray as xr

from xarray_selafin_backend.xarray_backend import SelafinBackendEntrypoint

slf_in = "tests/data/r3d_tidal_flats.slf"
ds_slf = xr.open_dataset(slf_in, engine=SelafinBackendEntrypoint)
ds_slf.selafin.write("test.slf", file_format="SERAFIN")  # keep it single precision

ds_slice1 = ds_slf.isel(time=slice(0, 10))
ds_slice1.selafin.write("test_0-10.slf", file_format="SERAFIN")

ds_slice2 = ds_slf.isel(time=0)
ds_slice2.selafin.write("test_0.slf", file_format="SERAFIN")

And below is the error traceback:

Traceback (most recent call last):
  File "C:\softs\WinPython-64bits-3.11.5.0\python-3.11.5.amd64\Lib\site-packages\xarray\core\utils.py", line 578, in __len__
    return self.shape[0]
           ~~~~~~~~~~^^^
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\PROJETS\python\xarray-selafin\debug_LDN.py", line 14, in <module>
    ds_slice2.selafin.write("test_0.slf", file_format="SERAFIN")
  File "C:\PROJETS\python\xarray-selafin\xarray_selafin_backend\xarray_backend.py", line 288, in write
    write_serafin(filepath, ds, file_format)
  File "C:\PROJETS\python\xarray-selafin\xarray_selafin_backend\xarray_backend.py", line 51, in write_serafin
    slf_header.nb_frames = len(ds.time)
                           ^^^^^^^^^^^^
  File "C:\softs\WinPython-64bits-3.11.5.0\python-3.11.5.amd64\Lib\site-packages\xarray\core\dataarray.py", line 728, in __len__
    return len(self.variable)
           ^^^^^^^^^^^^^^^^^^
  File "C:\softs\WinPython-64bits-3.11.5.0\python-3.11.5.amd64\Lib\site-packages\xarray\core\utils.py", line 580, in __len__
    raise TypeError("len() of unsized object")
TypeError: len() of unsized object

This is the reason why I switched to ds.time.size and did other fixes.

lucduron commented 9 months ago

A more interesting converter from 3D to 2D should be able to :

compute mean (or min/max?) over the vertical (taking into account the height of each layer)
add BOTTOM, FREE SURFACE and WATER DEPTH from ELEVATION Z

It is already implemented in PyTelTools in slf_3d_to_2d.py through pyteltools.slf.misc.VerticalMaxMinMeanCalculator.max_min_mean_in_frame.

tomsail commented 9 months ago

Hi @lucduron, I still have tests failing for dask. It seems to be the same error you mentioned in the other PR.

Also I wanted to implement to vector indexing in dask, like so:

        if isinstance(node_key, slice):
            node_indices = range(*node_key.indices(self.shape[1]))
        elif isinstance(node_key, int):
            node_indices = [node_key]
        elif isinstance(node_key, np.ndarray[int]):
            node_indices = node_key
        else:
            raise ValueError("node_key must be an integer, nd.array[int] or slice")

Which does not seem to work. I tried using this script:

import xarray as xr
import time
import numpy as np
from scipy.spatial import cKDTree

def closest_n_points(nodes, N, meshXY):
    def do_kdtree(meshXY, points, N):
        mytree = cKDTree(meshXY)
        dist, indexes = mytree.query(points, range(1, N + 1))
        return indexes
    return do_kdtree(meshXY, nodes, N)

def main():

    def xrSlf(slf_in):
        # read test
        ds_slf = xr.open_dataset(slf_in, engine="selafin")
        # get coords
        coords = np.array([[0, 100, 200, 300, 500], [0, 100, 200, 300, 500]])
        nodes = closest_n_points(coords.T, 1, np.column_stack((ds_slf.x.values, ds_slf.y.values))).T[0]
        ds_slice3 = ds_slf.isel(node = nodes)
        Z = ds_slice3.S.isel(time=-1, plan=0).compute()
        print(Z)
        print(Z.values)

    file_2d = "tests/data/r2d_tidal_flats.slf"
    xrSlf(file_2d)

main()

Oviously, using xarray is works well, but using dask I have this error:

<xarray.DataArray 'S' (node: 5)> Size: 40B
dask.array<getitem, shape=(5,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray>
Coordinates:
    x        (node) float32 20B -6.131 -6.131 158.0 158.0 390.6
    y        (node) float32 20B 10.33 10.33 263.2 263.2 500.0
    time     float64 8B 1.6e+05
Dimensions without coordinates: node
Traceback (most recent call last):
  File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/array/chunk.py", line 420, in getitem
    result = obj[index]
             ~~~^^^^^^^
IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tomsail/work/python/seareport_org/xarray-selafin-ld/test_ld.py", line 46, in <module>
    main()
  File "/home/tomsail/work/python/seareport_org/xarray-selafin-ld/test_ld.py", line 44, in main
    xrSlf(file_2d)
  File "/home/tomsail/work/python/seareport_org/xarray-selafin-ld/test_ld.py", line 24, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tomsail/work/python/seareport_org/xarray-selafin-ld/test_ld.py", line 41, in xrSlf
    print(Z.values)
          ^^^^^^^^
  File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/xarray/core/dataarray.py", line 770, in values
    return self.variable.values
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/xarray/core/variable.py", line 508, in values
    return _as_array_or_item(self._data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/xarray/core/variable.py", line 306, in _as_array_or_item
    data = np.asarray(data)
           ^^^^^^^^^^^^^^^^
  File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/array/core.py", line 1700, in __array__
    x = self.compute()
        ^^^^^^^^^^^^^^
  File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/base.py", line 375, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tomsail/miniconda3/envs/xrld/lib/python3.11/site-packages/dask/array/chunk.py", line 422, in getitem
    raise ValueError(
ValueError: Array chunk size or shape is unknown. Possible solution with x.compute_chunk_sizes()

@pmav99 would you have any idea why?

lucduron commented 9 months ago

It seems to me that there is a confusion between lazy loading and dask. Could we implement lazy loading without Dask? If yes, I can add a commit to support lazy loading in both cases... But the dask implementation will still have to be implemented. Besides, dask implementation might change the way files are open:

In order to support Dask Distributed and multiprocessing, BackendArray subclass should be serializable either with Pickle or cloudpickle. That implies that all the reference to open files should be dropped. For opening files, we therefore suggest to use the helper class provided by Xarray CachingFileManager.

Source: https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html#backendarray-subclassing

tomsail commented 9 months ago

Thanks Luc, Indeed lazy loading does not necessarily need dask implemented. I confirm we should have lazy loading in both cases and worry about the dask implementation down the line (really not a priority now).

If you can provide lazy loading for both cases, that'll be greatly appreciated! otherwise I can do it.

lucduron commented 9 months ago

The current implementation of SelafinBackendEntrypoint is definitly not satisfactory, it is very slow (around 45 times slower than by pure PyTeTools methods). Here is an example to benchmark reading (same order if I am right):

import numpy as np
import xarray as xr

from xarray_selafin_backend.xarray_backend import Serafin, SelafinBackendEntrypoint

import time

slf_in = "tests/data/r3d_tidal_flats.slf"
#slf_in = "r3d_last.slf"

t1 = time.perf_counter()
ds_slf = xr.open_dataset(slf_in, engine=SelafinBackendEntrypoint)
ds_slf.load()
t2 = time.perf_counter()
da = t2 - t1
print(da)  # 0.4430556000006618

t1 = time.perf_counter()
with Serafin.Read(slf_in, 'en') as resin:
    resin.read_header()
    resin.get_time()
    header = resin.header
    values = np.empty((header.nb_var, len(resin.time), header.nb_nodes_2d, header.nb_planes), dtype=header.np_float_type)
    for iv, var in enumerate(header.var_IDs):
        for time_index, t in enumerate(resin.time):
            array = resin.read_var_in_frame(time_index, var).reshape(header.nb_nodes_2d, header.nb_planes)
            values[iv, time_index, :, :] = array
t2 = time.perf_counter()
db = t2 - t1
print(db)  # 0.009758399999554968
print("RATIO = %s" % (da / db))  # 45.40248401591115

I do not know where it comes from...

tomsail commented 9 months ago

Have you also tried on bigger files? do you have the same ratio?

Also I can't succeed to export to netcdf / zarr on 2D meshes:

FAILED tests/io_test.py::test_to_netcdf[2D] - IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed
FAILED tests/io_test.py::test_to_selafin[2D] - IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed
FAILED tests/io_test.py::test_slice[2D] - IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

EDIT: actually I had it before and my "dirty fix" was to bypass the _raw_indexing_method() function.. not ideal

tomsail commented 9 months ago

only 2 tests failing now. (slicing + netcdf export)

lucduron commented 9 months ago

SelafinLazyArray seems really buggy and slow, I tried to rewrite it but did not succeed yet...

tomsail commented 9 months ago

Maybe the best solution for now would be the revert back to the eager (i.e. normal) mode. No dask, no lazy loading

That would be fine for me already

lucduron commented 9 months ago

I did a commit, the (uncommented) tests should pass but it might break some features (such as 3D to 2D for exemple).

tomsail commented 9 months ago

You pushed it first :) thanks. I was wondering why my push wasn't working.

I confirm all tests pass here.

tomsail commented 8 months ago

I'm experiencing the same performance on big files. good job ! may I implement the time fix?

tomsail commented 8 months ago

(changes are a bit more significant only because of the .pre-commit CI)

tomsail commented 8 months ago

I think we can merge. let me know if you have anything else to add before

tomsail commented 8 months ago

fixed #10 with last commit

oceanmodeling / xarray-selafin

Fix time selection, plan selection (by integer only), improve reading #9