Scalability, cloud support and and fast access

gavinmacaulay commented 4 years ago

From wg_WGFAST created by nilsolav: ices-eg/wg_WGFAST#38

There is a growing need in the community to support fast access to large volumes of sonar data, including interpretation (labels or annotations). Parallell processing, cloud computing and the use of deep learning frameworks like Pytorch or Keras/tensorflow need an efficient data model in the back end.

gavinmacaulay commented 4 years ago

High Throughput processing (job scheduler) with auto-scaling +S3+ZARR;
This method is applied to satellite data (netcdf, h5, bufr) processing before, object storage in bucket or blob storage on a container is the cheapest and scalable storage options;
the same idea from pangeo https://pangeo.io/data.html

also dask/xarray/zarr Relies on a json files to record attributes/shape of the dataset that you can also interact with outside using the library itself.

see example of zarr datastore attached from a small test. range_angle_40107_0_260000.zarr.zip

gavinmacaulay commented 4 years ago

Parallel HTTP 'range-requests' (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests) with netCDF files stored on cloud object stores offers the same performance as using pre-chunked formats like zarr.

This works by requesting individual slices of bytes in a netCDF file, which are similar to a 'chunk' of a netCDF file in zarr. There's some overhead involved to make that happen, but apparently if you can do that you might be able to still use netCDF.

gavinmacaulay commented 4 years ago

Keep in mind that the "raw" output from sonars are not tensors. They are usually not aligned in frequency and time, pings may drop out and the range vector may differ between frequencies. This was the rationale for suggesting the Gridded group, and perhaps that is a candidate for the more efficent N-d array methods? We used memory maps in python and that worked, but I would really like to see something that is platform/language independent.

gavinmacaulay commented 4 years ago

This NOAA strategy is of relevance to the above: https://www.noaa.gov/media-release/noaa-finalizes-strategies-for-applying-emerging-science-and-technology

gavinmacaulay commented 4 years ago

Data standards - Scalability consideration (Saildrone perspective)

Data standards for labeled acoustic data need to be able to support efficient computation in the cloud as the number of files and the volume of data scales exponentially. A repository of netCDF files would not enable this. Based on Saildrone's cloud-centric experience, the considerations below, need to be taken into account to ensure data standards support exponential growth of acoustic data,with considerations for access speed, computational efficiency and storage volume / costs.

The Problem set

Input and outputs are tensors. Data are larger than a memory. Computation can be parallelized. I/O is a bottleneck. Data are compressible. Speed matters. Cost matters Data mean different things to different users(unit & resolution)

Solution space to explore

Scalable cloud storage solution (eg AWS S3) File format (eg. Parquet + NetCDF) Compression - (eg. HDF5 + Parquet) Metadata (eg. Netcdf + Metadata Service) Checked, parallel tensor computing framework - (eg. Dask or Spark). Chunked, parallel tensor storage library (eg Zarr) Upload / Download / Process / POC (eg Jupyter notebook)

gavinmacaulay commented 4 years ago

Here is a process suggested by a colleague from the NOAA ocean modelers team, which may be useful for Sonar data? (1) develop sample pipelines for pushing/post-processing model-generated data to/at the cloud in cloud-optimized format (Zarr); (2) deploy a Pangeo Cloud instance specifically configured for the analysis and visualization of data; (3) develop reproducible Jupyter notebooks to operate on the data with efficiency as a part of cloud workflow; (4) develop stand-alone web applications and services that utilize the same scalable infrastructure on the backend.

gavinmacaulay commented 4 years ago

To followup on Nils Olav's comment above, multi-channel echosounders can generally only provide data in a tensor format with some processing of their 'raw' data output.

A place can provided for this processed data in the sonar-netCDF format (e.g., the Gridded group mentioned in Nils Olav's comment) - but will this help to address the problem set?

gavinmacaulay commented 4 years ago

Here is what I need, and perhaps also what others need: -We need code that converts proprietary raw data to the sonar-netCDF format -We need code that converts the interpretation masks in LSSS, Echoview and any other software to interpretation masks in sonar-netCDF. I have matlab code that can read the LSSS masks -We need code that reads the sonar-netCDF (both raw data and interpretation masks) and regrid it into a common grid (set by parameters) and write it to a cloud friendly format that can be efficiently used by TensorFlow, pytorch or any other machine learning framework with a python API.

After following this discussion, I think that the interpretation masks in the gridded data should follow the grid, i.e. less flexible than what is suggested in sonar-nerCDF. Also, it seems that there are requirements for the gridded data in terms of cloud support that would suggest not to use netCDF, but the convention should still apply in terms of content.

pyEcholab offer ping alignment function.

Any thoughts on this?

gavinmacaulay commented 4 years ago

Wu-Jung's Echopype converts the raw EK60, EK80 and AZFP to sonar-netCDF. However, she's identified that the sonar-netCDF isn't particularly cloud friendly and not immediately scalable to working in a cloud environment. We are able to run pyEcholab on AWS now. Testing of large volumes of EK60 data hosted in S3 buckets to start shortly. Exported processed data format are still being explored and will be dictated by the preference of the community. This is an interesting article on the different formats where Zarr and N5 - the java sibling of Zarr are discussed along with hdf.

ices-publications / SONAR-netCDF4

Scalability, cloud support and and fast access #2