Gridded data - Githubissues

nilsolav commented 3 years ago

Hi,

We are working on how to grid our data from the .raw files to a ML friendly format, and we intend on using the grid_group for that purpose. One approach would be to process and grid the data to sv(time,range,frequency). sometimes there are ping dropouts and different range resolution. To fit that into a regular grid, we have a few solutions:

1) Interpolate/average the data onto the grid of the main frequency. This will alter the data and there are indications that the netowrks do not respond well to this. at the same time, we would like to keep the resolution as close to the orignal as possible.

2) We can fill in missing pings and pad the lower resolution frequencies with NaN. This maintains the original data (except the NaNs in the missing pings), but the range vectors between frequencies will be unaligned. The lower frequency data will then contain NaN's at the indexes beyond the range vector.

3) Create one grid_group per transducer to allow for different time and range vectors between frequencies. This maintains more of the original data structure, but is less straight forward to add into a ML framework.

Any suggestions and input on what the best way forward would be?

Nils Olav Handegard

lberger29 commented 3 years ago

I vote definitively for solution 1 with the ability to grid the data vertically as a function of depth rather than as a function of range and horizontally as a function of distance travelled or number of pings. The beamgroup aims at collecting raw data the gridgroup needs to Interpolate/average raw data on a grid. The finest resolution is indeed the resolution of the main frequency (or the resolution of the frequency with the smallest sampling interval)

gavinmacaulay commented 3 years ago

This is something I wrote earlier about storage of gridded data in sonar-netcdf4:

The existing format stores each echosounder channel (aka frequency) in separate groups under the /Sonar group. This allows for different sample rates, different ping rates, etc, for each channel. Very flexible and able to cope with all echosounder and sonar systems that I know of. A bit awkward to deal with if you want to look at channels on a common ping interval or range grid. Each group also contains the 3D position and orientation of the transducer, transducer parameters, calibration, etc.

A storage structure for multi-channel data that has potentially been re-gridded into a form with homogenous ping times and range bins is awkward to fit into the existing format and will make the format quite complex (there would be lots of different structures for variables depending on the form of the multi-channel data). For this reason I think a completely separate group structure for homogenous multi-channel data is necessary.

Following the structure that Wu-Jung Lee has used on echopype seems sensible. That library has an explanation on what it finds sensible and convenient to use and is based on real-world usage. This format has a 3D matrix for the backscatter with dimensions of ping_time, range_bin, and channel (aka frequency). Transducer and transceiver parameters (e.g. calibration) are then vectors, indexed on channel. A file with both the existing and gridded formats could have a group structure like this:

/Sonar/Beam_group1 (e.g., channel 1, 38 kHz) /Sonar/Beam_group2 (e.g., channel 2, 70 kHz) /Sonar/Beam_group3 (e.g., channel 3, 120 kHz) /Sonar/Beam_group4, etc. /Sonar/Beams_grid1 (e.g., gridded dataset of 38, 70, and 200, with 1 n.mile and 5 m range bin) /Sonar/Beams_grid2 (e.g., gridded dataset of 38, 70 with 5 n.mile and 10 m range bin) /Sonar/Beams_grid3, etc.

The information in Beam_groupX and Beams_gridX will contain quite some duplication – this is necessary as the convention should allow for files with just Beam_groupX or files with just Beams_gridX.

Questions to consider:

Should Beams_gridX allow for storage of subbeam data (e.g., the quadrants/sectors from split-beam systems, aka EK80) or just allow for storage of split-beam angles (aka EK60). Beam_groupX allows for either.

Beam_groupX allows for storage of complex numbers for the backscatter. Should Beams_gridX allow this? Or just received power, or just calibrated Sv and TS?

The users of the gridded storage are looking for easy access to data in a convenient format - this should drive how it is stored. With that in mind, option 1 looks best. The same range bin across all channels/frequencies also looks to be what users want (and keep in mind that there could be multiple gridded datasets in the one file - e.g., fine, medium, and coarse). It is also perfectly fine to resample a channel to a finer range bin if one wants to keep the finer resolution of the higher frequencies.

lberger29 commented 3 years ago

Regarding pending questions in Gavin's comment, I would not recommend to store subbeam data in Beams_gridX and to store only calibrated Sv or TS data which are standard quantities for post-processing

ketil-malde commented 3 years ago

I am probably missing something crucial, but if you modify ping times, power, or any other parameters, regridding is going to be a lossy process. For this reason, it would seem natural to first do normalization and conversions on individual pings (in the Beam_group) to standard values (S_v) as an intermediate stage. Interpolating to a grid with the desired coordinate system (depth x time or depth x distance?) can then be done easily with the user's preferred resolution, smoothing method, and dealing with missing data)

lberger29 commented 3 years ago

Minimum spatial averaging is needed for comparing backscatter data from non colocated transducers with variable beamwidths. The BeamGroup data is still available for analysing the statistics of the raw backscatter data for a single frequency but when dealing with multifrequency analysis averaging data on the same grid at different resolution is in my opinion the best way to move forward. Initial filtering for noise reduction on each transducer must have been completed on the BeamGroup before gridding the data.

nilsolav commented 3 years ago

I prefer to do the spatial averaging and noise filtering as the first layers in the network. CNN's can be set up to very efficiently do these steps, either through training or by setting fixed weights in the convolutions. The beam group is inconvenient as a starting group for a CNN, and a high resolution gridded data set would be lovely (either option 1 and 3 above). The gridded group can be used for this purpose.

lberger29 commented 3 years ago

OK. Option 1 is then more flexible since it allows:

the initial preprocessing on a high resolution grid on each transducer
the use of multi-dimensional array for multifrequency analysis at the desired grid size Three grid sizes could then co-exist in the same file, one at the initial sampling frequency of the reference transducer, one at a size spatially relevant for clustering targets, one at the size of the manual scrutinizing of the data by an expert during regular surveys.

gavinmacaulay commented 3 years ago

So a 3d matrix with axes of ping, range, and channel seems agreeable? Any preferences on the order of the axes?

lberger29 commented 3 years ago

ping,beam,range or distance,beam,depth ?

tomvanengeland commented 3 years ago

At the moment we use ping, range, beam. Ping is currently the only limited dimension, which is a property of the netcdf3 standard. More unlimited dimensions are possible only in netCDF4. I am not aware of strong restrictions on the position of this unlimited dimension, but it might play a role in the speed of data access...

cyrilleponcelet commented 3 years ago

As far as I know order of dimension and their type (unlimited or not) have a big impact on performances. I guess the order of dimension is given as it is by netcdf to HDF5 and the way HFD5 store data is explained here : https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html

As you can see, if ping is unlimited for example it means that for people trying to read a whole ping,beam for a channel matrix it will fetch data all over the file ping per ping.

It is not really a preference order for the axis, but we need to thing how people will mostly read the data for their process, how they would like to cut slice in the 3D matrix.

nilsolav commented 3 years ago

I do not have any experience with these formats, but Cyrille's point makes intuitively sense to me. Say I would like to make one data structure for an entire survey. Then the time dimension will be long. One subset passed on to a CNN will typically use the full frequency dimension and a subset in range and depth. The frequency samples should from that point of view be co-located, then range (since you may want to list a ping), and finally time (since slicing across time for a given freq and range is probably less useful?).

But, as I pointed out, this makes intuitively sense to me, but that depends on whether my intuitions make any sense.

tomvanengeland commented 3 years ago

Thanks for the link, Cyrille. Then I think we are all on the same line. NetCDF does not impose restrictions in version 4, but impact on I/O efficiency is to be expected. Apparently, the NetCDF format uses row-major storage layout, similar to C and Numpy (in the default setting at least; this can apparently be changed for Numpy...). This means that the slowest varying dimension of a 3D array comes first, and this should be the unlimited dimension if one is to be implemented. NetCDF versions <= 3.6 even required that this unlimited dimension would come first. From NetCDF4 multiple unlimited dimensions can be implemented. I think that for our 3D array of Sv values one unlimited time dimension is sufficient. This enables us to add several raw files from the same survey sequentially to the same netCDF file. At the same time this is also the dimension that will most often be used to subset the data set (a stretch of pings along a relevant transect). The question then remains as to what the order should be for the other two dimensions, frequency (channel) and range (bin). This depends on how the data is most often accessed: (1) one complete channel, or (2) all channels for a part of the water column profile. If I consider my own case, I am using all the channels together in my analyses, and would cut off the profiles at some depth (due to the use of the high-frequency channels). So for me the order (time, channel, bin) would allow me to obtain the largest data chunks in one operation. But I can imagine that other people prefer to focus on a subset of channels (time,bin,channel). The latter layout somehow also feels more familiar and easier to think about, but that's just my subjective feeling ... What is your opinion? NetCDF4 also has default chunking layouts and chunk caching mechanisms that help in optimizing disk I/O. This somewhat alleviates the problem of non-contiguous disk I/O. It adjusts chunk sizes during file creation based on, among others, the lengths of the fixed-size dimension. In this way it optimizes disk I/O along all dimensions together, without preference for one dimension. Changing these settings can be done but will probably imply that other types of I/O will perform much worse than the one that is optimized for. Does anyone of you have some insights on this? Is there a preference within the community on the order of the last two dimensions, considering the chunking mechanisms?

nilsolav commented 3 years ago

I have looked at different code snippets for gridding data, and linear interpolation seem to be used in several cases. In my view, the main requirement should be that the integrated sv should remain invariant. If you use linear interpolation this requirement may not be met, e.g. when regridding from a high sample rate onto a lower one.

We think this is the correct solution: https://github.com/CRIMAC-WP4-Machine-learning/CRIMAC-preprocessing/blob/NOH_pyech/regrid.py

Any thoughts?

It also makes me wonder if option 3 in the original post is the preferred option, i.e. grid each frequency separately to avoid any lossy operations during regridding. This makes it more complicated for the networks as you need a separate input layer per frequency. the convention can handle all the options, but it would be good with some best practices/alignment if we plan on sharing data etc.

gavinmacaulay commented 3 years ago

No need to write your own 'conservation of integral' regridder - there are packages for that - for example see xESMF and https://xesmf.readthedocs.io/en/latest/other_tools.html.

A new complication is also the latest EK80 version where not all channels necessarily ping at the same time, meaning that regridding along the ping axis might be desirable.

leewujung commented 3 years ago

For the format question, as @gavinmacaulay said, in echopype we are taking the approach to store the raw data in frequency (channel) x ping_time x range_bin (number of samples along echo time series) as a 3D array for EK60 data, and for EK80 adding the 4th dimension to store data for all subchannels. The channels or pings with shorter ranges are therefore padded with NaN, so are the "missing" pings when the channels are configured in an interleaving sequence. This doesn't impact the file size much due to the option to compress data.

Since regridding is a lossy process as @ketil-malde pointed out, I think it'll be nice to have an intermediate group, in a form similar to the current Beam group form to store Sv, and then put uniformly regridded Sv in the Grid group(s) depending on the resolution user chooses to have. Right now we use the approach to keep Sv of each channel at the same resolution as the raw data, and calculate the actual range in meters from range_bin when interpolating/regridding.

Having multiple resolutions on the Grid group seems a great way to address needs for different applications, and I wonder if a similar approach could be used to deal with the access pattern question, ie to specify the intended use and storage-access pattern match.

nilsolav commented 3 years ago

We are making progress. The grid_group is simply an array so all of the above will fit, and we can follow @leewujung suggestion of a step wise approach. I think that is an excellent idea and follow the thinking of other communities. From my point of view we should minimize the loss to preserve as much information as possible between each step. But there is a need with some loss for each step, and there is a need to standardize to allow data sharing. An implementation can of course choose to skip some of the steps.

Here's my suggestion:

Step 1. Raw data. Whatever that is.

Step 2. Beam group (lossless) with step 1, but a standard and open format (not really the topic of this issue). I think @gavinmacaulay pretty much have figured out this.

Step 3. Grid_group gridded in time and range, i.e. one group per channel. This allows for different ping rates, ping configurations (e.g. FM, c.f. @gavinmacaulay comment above) ping dropouts, and different range resolution . Pad with NaN if range changes (following @leewujung's approach). It is a gridded format, but no regridding is performed unless pulse length/sample interval changes. In that case we can flag the data as lossy; otherwise it is lossless, addressing @ketil-malde's comments above. This is similar to @leewujung approach, except that we do not align the frequencies.

Step 4. Grid_group casted/regrided to a 3D array. This is the format that @lberger29 recommends, and is a convenient format for DL. The dimensions order should follow the thinking of @tomvanengeland and @cyrilleponcelet above. This is lossless if the range resolution is the same across frequencies and allows slicing in the frequency domain. Missing pings should be padded with NaN's. If pinging by groups is used, we use union of the the time points and fill in NaNs for the "missing" pings. This may require gridding in range, but not time. Any gridding algorithm should be specified and should also preserve the echo integral, c.f. @nilsolav s post above or the "conservative" method in the library from @gavinmacaulay's post.

Step 5. Grid_group regridded to any resolution. This is a lossy operation, and can be used for generating a uniform time/range grid, including that for outputs/reports. On a fine scale, this is very similar to step 4 but with a uniform time scale. The gridding algorithm must be specified and should preserve echo energy. We could also consider to have acoustic category as a fourth dimension sine the masks should be applied to a finer resolution before gridding, or set up another step (Step 6) in time/depth/category. This will allow us to directly support the ICES reports.

Do you think this is a viable approach? If yes, let me know what you think should be adjusted, and I will write it up and set up a pull request to the document.

gavinmacaulay commented 3 years ago

For step 2, the existing beam_group definitions will work (just need to store Sv instead of voltage or power - will need to add a new equation type - trivial to do that).

The rest looks fine also.

lberger29 commented 3 years ago

I also like the step wise approach.

Step1. Current BeamGroup definition for raw data in voltage or power Step2. Extended BeamGroup definition for CW Sv data (it is already the case for angles as we can store physical arrival angles rather than electrical phase angle). What about FM data? Shall we store Sv(f) data at a user specified range of frequencies for each channel? Step3. GridGroup Sv data in time and range for each channel padding missing range and resampling when sample interval changes. What about FM data? Step4. GridGroup Sv data for all channels with specified dimension order, resampling in range dimension and padding in ping dimension. What about FM data? Step 5. GridGroup Sv data for all frequencies of all channels with specified dimension order. The grid can be time/range/frequency or distance/depth/frequency.

All steps are not mandatory and Step5 can be computed from Step2 directly

nilsolav commented 3 years ago

I am not sure what the best approach for FM would be. Off the top off my head: -For step 2, I think the 4 complex sample data from FM needs to be preserved. -For step 3, I think we also should preserve as much as possible from the raw data, but cast the data in a grid.

I also tried to write up what appears to be a converging consensus: https://github.com/nilsolav/SONAR-netCDF4/blob/NOH_griddefs/docs/crr341.adoc#the-recommended-use-of-the-sonar-netcdf4-convention-in-post-processing-software

I added this to a branch and if could comment on it, I can revise it before I set up a pull request.

Nils Olav

lberger29 commented 3 years ago

In step2 the data is still a BeamGroup data for which backscatter_r is in Sv and currently we cannot have both the 4 complex samples and the Power/Sv match filtered data in the same BeamGroup nor can we have Sv at multiple frequency from spectral analysis of the time series of a beam. We could then consider:

logging match filtered Sv data from Step2 to step 5 in BeamGroup and GridGroup with fine resolution grid cells but I am not sure this is really relevant and for what usage since we loose the frequency component of the data in the band of each transducer
logging Sv(f) data in GridGroup for each beam at a resolution different of initial sampling resolution and related to processing parameter for all channels (Step4 and 5)

nilsolav commented 3 years ago

Hi,

We have a "near-lossless" working code for casting the beam data into a gridded structure, and writing it to a Zarr or NC. Any regridding is "conservative" and the echo integral over range is invariant. We have also created a docker image, so if you have docker it is very simple to run. Just copy paste, and adjust the paths from the example in the "readme":

https://github.com/CRIMAC-WP4-Machine-learning/CRIMAC-preprocessing

nilsolav commented 1 year ago

To follow up on this and follow up on the discussion @ WGFAST in Portland.

We have made code to generate gridded data, see previous post, but we are using a much simpler format and we do not follow the grid group convention.

Below is how we currently do this. Could we try to reconcile this with the standard? I also really like @leewujung 's steps in the data processing that we should take another look at.

We are happy to change how we do this, and any comments are welcome.

sv
<xarray.Dataset>
Dimensions:            (frequency: 6, ping_time: 3052680, range: 2650)
Coordinates:
    channel_id         (frequency) <U35 dask.array<chunksize=(1,), meta=np.ndarray>
  * frequency          (frequency) float32 3.8e+04 1.8e+04 ... 2e+05 3.333e+05
    latitude           (ping_time) float64 dask.array<chunksize=(12210,), meta=np.ndarray>
    longitude          (ping_time) float64 dask.array<chunksize=(12210,), meta=np.ndarray>
  * ping_time          (ping_time) datetime64[ns] 2009-05-03T20:07:13.753999 ...
  * range              (range) float64 -0.1888 0.0 0.1888 ... 499.6 499.8 499.9
    raw_file           (ping_time) <U29 dask.array<chunksize=(12210,), meta=np.ndarray>
Data variables:
    angle_alongship    (frequency, ping_time, range) float32 dask.array<chunksize=(1, 12210, 2650), meta=np.ndarray>
    angle_athwartship  (frequency, ping_time, range) float32 dask.array<chunksize=(1, 12210, 2650), meta=np.ndarray>
    distance           (ping_time) float64 dask.array<chunksize=(12210,), meta=np.ndarray>
    heading            (ping_time) float32 dask.array<chunksize=(12210,), meta=np.ndarray>
    heave              (ping_time) float32 dask.array<chunksize=(12210,), meta=np.ndarray>
    pitch              (ping_time) float32 dask.array<chunksize=(12210,), meta=np.ndarray>
    pulse_length       (frequency) float32 dask.array<chunksize=(1,), meta=np.ndarray>
    roll               (ping_time) float32 dask.array<chunksize=(12210,), meta=np.ndarray>
    speed              (ping_time) float64 dask.array<chunksize=(12210,), meta=np.ndarray>
    sv                 (frequency, ping_time, range) float32 dask.array<chunksize=(1, 12210, 2650), meta=np.ndarray>
    transducer_draft   (frequency, ping_time) float64 dask.array<chunksize=(1, 12210), meta=np.ndarray>
Attributes:
    commit_sha:         e618f31f097ccf1522b5e0420f5b4447d52687eb
    description:        Multi-frequency sv values from EK.
    git_revision_hash:  docker
    name:               CRIMAC-preprocessor
    pyecholab:          CRIMAC-WP4-Machine-learning/pyEcholab.git@d4f4504c278...
    time:               2023-01-25T23:32:59Z
    version:

We have also made a pixel based mask that is the same as the sv data.

<xarray.Dataset>
Dimensions:     (category: 4, ping_time: 3052680, range: 2650)
Coordinates:
  * category    (category) int64 -1 1 27 12
  * ping_time   (ping_time) datetime64[ns] 2009-05-03T20:07:13.753999 ... 200...
  * range       (range) float64 -0.1888 0.0 0.1888 0.3776 ... 499.6 499.8 499.9
Data variables:
    annotation  (category, ping_time, range) float32 dask.array<chunksize=(1, 40000, 2650), meta=np.ndarray>
    object      (ping_time, range) int64 dask.array<chunksize=(40000, 2650), meta=np.ndarray>
    objecttype  (ping_time, range) int64 dask.array<chunksize=(40000, 2650), meta=np.ndarray>
Attributes:
    commit_sha:   XXXXXXXX
    description:  CRIMAC-labels
    name:         CRIMAC-labels
    time:         2023-02-20T01:03:55Z

nilsolav commented 1 year ago

We also have a gridded format where we try to conform with the ICES acoustic database variable names. We can then easily translate our data to the survey estimation software used in ICES.

https://www.ices.dk/data/data-portals/Pages/acoustic.aspx

https://www.ices.dk/data/Documents/Acoustic/ICES_Acoustic_data_format_description.zip

<xarray.Dataset>
Dimensions:            (SaCategory: 2, Time: 1798, ChannelDepthUpper: 99)
Coordinates: (12/14)
    ChannelDepthLower  (ChannelDepthUpper) float64 dask.array<chunksize=(99,), meta=np.ndarray>
  * ChannelDepthUpper  (ChannelDepthUpper) float64 0.0 5.0 10.0 ... 485.0 490.0
    Distance           (Time) float64 dask.array<chunksize=(1798,), meta=np.ndarray>
    Latitude           (Time) float64 dask.array<chunksize=(1798,), meta=np.ndarray>
    Latitude2          (Time) float64 dask.array<chunksize=(1798,), meta=np.ndarray>
    Longitude          (Time) float64 dask.array<chunksize=(1798,), meta=np.ndarray>
    ...                 ...
    Origin2            (Time) <U3 dask.array<chunksize=(1798,), meta=np.ndarray>
  * SaCategory         (SaCategory) int64 27 1
  * Time               (Time) datetime64[ns] 2019-05-11T00:27:41.500000 ... 2...
    Validity           (Time) <U1 dask.array<chunksize=(1798,), meta=np.ndarray>
    channel_id         <U38 ...
    frequency          float64 ...
Data variables:
    BottomDepth        (SaCategory, Time) float64 dask.array<chunksize=(2, 1798), meta=np.ndarray>
    heading            (SaCategory, Time) float32 dask.array<chunksize=(1, 1798), meta=np.ndarray>
    heave              (SaCategory, Time) float32 dask.array<chunksize=(1, 1798), meta=np.ndarray>
    pitch              (SaCategory, Time) float32 dask.array<chunksize=(1, 1798), meta=np.ndarray>
    pulse_length       (SaCategory) float32 dask.array<chunksize=(2,), meta=np.ndarray>
    raw_file           (Time) <U29 dask.array<chunksize=(1798,), meta=np.ndarray>
    roll               (SaCategory, Time) float32 dask.array<chunksize=(1, 1798), meta=np.ndarray>
    speed              (SaCategory, Time) float64 dask.array<chunksize=(1, 1798), meta=np.ndarray>
    transducer_draft   (SaCategory, Time) float64 dask.array<chunksize=(1, 1798), meta=np.ndarray>
    value              (SaCategory, Time, ChannelDepthUpper) float64 dask.array<chunksize=(1, 1798, 99), meta=np.ndarray>
Attributes: (12/14)
    LocalID:                      NaN
    PingAxisInterval:             0.1
    PingAxisIntervalOrigin:       start
    PingAxisIntervalType:         distance
    PingAxisIntervalUnit:         nmi
    Platform:                     NaN
    ...                           ...
    conversion_software_name:     CRIMAC-reportgeneration
    conversion_software_version:  NA
    conversion_time:              2022-10-12T11:00:05+00:00
    source_filenames_bottom:      {'algorithm': 'simple', 'commit_sha': 'ce1a...
    source_filenames_labels:      {'description': 'regriddingPaper_v1_baselin...
    source_filenames_sv:          {'commit_sha': 'XXXXXXXX', 'description': '...

leewujung commented 1 year ago

Pinging @emiliom on this thread so that he's in the loop!

ices-publications / SONAR-netCDF4

Gridded data #33