ONC Dataset Generation Pipeline

This repository contains the pipeline to obtain and process ONC data from hydrophones. The dataset generation pipeline is divided in steps. Each step can be performed separately, but some of the steps are pre-requisites of the previous ones.

Attention: Is HIGHLY recommended to have a large storage available (at least 2TB) to download the ONC WAV Files.

Environment Setup

A Dockerfile is available at this repository to simplify the environment setup. As the needed sources are only Python dependencies, a virtual environment can also be created.

To install the dependencies, run the following commands:

pip install -r requirements.txt
pip install -r requirements-dev.txt

Pipeline description

A brief pipeline description can be found below, separating the development into 13 steps:

Step 0 - Query ONC deployments

Query the ONC server for the deployments of the choosen hydrophones;
Read the following information: recording begin, recording end, latitude, longitude, depth, and location;
Save the information into a .csv file.

Step 1 - Download AIS files

Search for AIS data from the date choosen;
Download the txt files from ONC.

Step 2 - Download audio files

Search for WAV data from the date choosen;
Download the .wav files from ONC. WARNING: This step requires a lot of available disk memory. The smallest deployment have more than 1TB of audio data.

Step 3 - Parse AIS to JSON

This function parse the ais messages downloaded from ONC into JSON files, filtering by the type of the messages and discarting messages without the needed values.

Find the downloaded .txt AIS files;
Keep only the relevant messages. They are: Position report, Static and voyage related data, Standard Class B equipment position report, Extended Class B equipment position report, and Static data report.
Filter from those messages, only a few informations: Positioning (x and y), SOG, COG, true heading, and type and cargo codes;
Save the corresponding information into a .json file.

Step 4 - Clean AIS data

Read the .json files into dataframes;
Propagate the 'type_and_cargo' messages throughout the MMSI's;
Drop messages without positional coordinates and/or duplicates;
Calculate the distance from the hydrophone to the vessel;
Filter only the data that fits the choosen scenario;
Save the corresponding information into a .feather file.

Step 5 - Combine deployment AIS data

Read the cleaned AIS files;
Removes the Vessel entries that have just one message;
Dump AIS data to a monolithic .feather file;
Generate a new data with linearly interpolated values to obtain more granularity;
Combine the raw and interpolated data frames;
Dump AIS interpolated data to a monolithic .feather file;

Step 6 - Identify scenarios

Find all of the cleaned AIS files for each deployment;
Find the time intervals where only one vessel is within range;

Step 7 - Classify WAV files from range

Read .csv file to extract periods of timestamp;
Search on raw WAV files folder for the correct period of time;
Read the wav files and split into 1 minute normalized pieces of audio;
Group the pieces of audio with the period of ais files range;
Save into correct folder.

Step 8 - Download CTD files

Search for CTD data from the date choosen;
Download from ONC.

Step 9 - Clean CTD files

Select only information of salinity, conductivity, temperature, pressure, and sound speed;
Save the corresponding information into a .feather file.

Step 10 - Generate the metadata for the full dataset

Get the following information from each time period: label, duration, file path, sample rate, class code, date, MMSI;
Get also a average for the time period of the CTD data: salinity, conductivity, temperature, pressure, and sound speed;
Normalize the CTD information;
Save all the data into a .csv file.

Step 11 - Generate a balanced version of the full dataset

Count the occurrences of each class;
Do a undersample strategy to crop the longer classes according to the smaller one;
Save all the data into a .csv file.

Step 12 - Generate metadata for small periods of duration

Read the original metadata generated on Step 10;
Create a new column named sub_init to accomodate the time frame where this new entry will start;
Split the .csv row according with the duration choosen;
Create a new row for each new entry;
Save all the data into a .csv file.

Step 13 - Split dataset into Train, Test and Validation

Read all the metadata;
Apply a random sort on the data;
Save all the data into three .csv files: Train, Validation, and Test.

Reference

The results from this work were published at IEEE Access, at the following reference:

An Investigation of Preprocessing Filters and Deep Learning Methods for Vessel Type Classification With Underwater Acoustic Data

@article{domingos2022investigation,
  author={Domingos, Lucas C. F. and Santos, Paulo E. and Skelton, Phillip S. M. and Brinkworth, Russell S. A. and Sammut, Karl},
  journal={IEEE Access}, 
  title={An Investigation of Preprocessing Filters and Deep Learning Methods for Vessel Type Classification With Underwater Acoustic Data}, 
  year={2022},
  volume={10},
  number={},
  pages={117582-117596},
  doi={10.1109/ACCESS.2022.3220265}}

A complete literature review containing the background knowledge of this work is available on the following reference:

A Survey of Underwater Acoustic Data Classification Methods Using Deep Learning for Shoreline Surveillance

@article{domingos2022survey,
  author={Domingos, Lucas C. F. and Santos, Paulo E. and Skelton, Phillip S. M. and Brinkworth, Russell S. A. and Sammut, Karl},
  title={A Survey of Underwater Acoustic Data Classification Methods Using Deep Learning for Shoreline Surveillance},
  volume={22},
  ISSN={1424-8220},
  url={http://dx.doi.org/10.3390/s22062181},
  DOI={10.3390/s22062181},
  number={6},
  publisher={MDPI AG},
  journal={Sensors},
  year={2022},
  month={Mar},
  pages={2181}
}

Acknowledgements

This code, as well as the pipeline formulation and the code used as the basis, was developed in collaboration with Philliip Skelton.

Thanks to Paulo Santos for the guidance and participation in this project.

lucascesarfd / onc_dataset

readme