Create an Orcasound data catalogue and facilitate data access

valentina-s commented 1 year ago

This project aims to facilitate Orcasound Data Access. Orcasound data is part of the Registry of Open Data on AWS. Due to the streaming structure of the data (small .ts files), it can be a bit hard for a newcomer to query the data. The goal of this project is to improve the quality of the Orcasound data by following the FAIR(Findability, Accessibility, Interoperability, and Reuse) principles for scientific digital assets. The aim is to build a data catalogue and a user friendly package to facilitate the access and abstract the dependence on the data structure which may change in the future. Useful features will be the ability to quickly identify when data are available and retrieve audio based on node, time range, time frequency, etc. into a desired output format. The orca-hls-utils package has some of this functionality and would benefit from more abstraction, testing, documentation. Many other projects will benefit from this package.

Expected outcomes: A Python package to ease access for free, open Orcasound audio data.

Required Skills: Object Oriented Python, Project Packaging

Bonus Skills: ffmpeg, Cloud Computing, experience working with large datasets

Mentors: Valentina, Scott

Difficulty level: Hard

Project Size: 175 or 350 h

Resources: OOIPY: a package for accessing data from Ocean Observatories Initiative Amazon S3 Inventory: a service to create an inventory catalogue for data on Amazon S3 which can be automatically updated and stored in csv or parquet format. ffspec: Python package to interface with different filesystems in the same way

Points to consider in your proposal:

How would you optimize for accessing many small files? Can you parallelize some operations? Can you isolate the dependence on the cloud provider? Can access to a catalogue abstract and speed up the data access? Can some data be cached? What would be the API?

Getting Started: Get acquainted yourself with the Orcasound data on AWS: access.md Look through these notebooks experimenting with accessing data. Compare the performance reading data directly with orca-hls-utils vs through the parquet catalogues. Can you make some speed improvements?

scottveirs commented 7 months ago

@vaibhavmehrotraml @ttan06 @zprice12

@paulcretu As we consider this issue further and also revise orcanode code this year, it may be worth re-visiting the file naming convention and size/duration for the FLAC data in the archive-orcasound-net S3 bucket.

Are there ways we can align with the BCHN file naming conventions at the same time we re-organize Orcasound data access to optimize ambient-sound-analysis efficiency (e.g. parallelization, cost)?

scottveirs commented 7 months ago

Here are a few related discussions, issues, and places where hls-utils are used:

2023 discussion of a new audio data naming scheme for Orcasound (including potential alignment with the BCHN formats used by @ben-hendricks )
2018 issue in orcanode seeking human-readable file names (which guided initial decisions about the FLAC filenames that we've been generating for the last 12 months at Port Townsend as an experiment in lossless streaming and associated costs)
The OrcaHello live inference system accesses the HLS streams via the PrepareDataForPredictionExplorer.py script.

orcasound / orca-hls-utils

Create an Orcasound data catalogue and facilitate data access #12