I wanted to summarise some of the Orcasound ML objectives that could be pursued in the hackathon
ML Objectives
Improving the detection model. Better performance, or longer context windows to eliminate boat noise
Use the contributed detections as weak labels for retraining the model (Continual Learning or Weakly Supervised Learning)
Species differentiation (Some species labels from DCLDE, ONC, Orcasound, and OOI)
Click detection vs call detection (I have labels for clicks-only data)
Available Data
I can upload OOI (custom agreement) and ONC (CC-BY) data to azure ahead of time for speed of training during the hackathon. We should also place it on huggingface or a dataverse which allows licensing control so that others may reproduce the work.
I can provide labels for marine mammal presence absence labels for 768 instances from OOI (4TB of negative files), 1469 from Orcasound (68000 negative files), and 17290 from ONC (40TB negative files). I can also provide a pre-trained wav2vecU-2 backbone if users are just interested in fine-tuning models. These ~20k positive files also have species and ecotype annotations for granular classification. About ~2500 of the calls have specific call start and end timestamps if someone wants to take a shot at call catalogue classification.
I wanted to summarise some of the Orcasound ML objectives that could be pursued in the hackathon
ML Objectives
Available Data I can upload OOI (custom agreement) and ONC (CC-BY) data to azure ahead of time for speed of training during the hackathon. We should also place it on huggingface or a dataverse which allows licensing control so that others may reproduce the work.
I can provide labels for marine mammal presence absence labels for 768 instances from OOI (4TB of negative files), 1469 from Orcasound (68000 negative files), and 17290 from ONC (40TB negative files). I can also provide a pre-trained wav2vecU-2 backbone if users are just interested in fine-tuning models. These ~20k positive files also have species and ecotype annotations for granular classification. About ~2500 of the calls have specific call start and end timestamps if someone wants to take a shot at call catalogue classification.
Aspirational Hackathon Formats It would be nice to formalise a task and leaderboard similar to how these hackathons/benchmarks do it: DCASE Challenge Task 5 is a good hackathon motivation:https://dcase.community/challenge2022/task-few-shot-bioacoustic-event-detection I also like the WILDS challenge, which lacks audio: https://wilds.stanford.edu/
I can provide data, some data loading scripts in pytorch or huggingface, and a test set environment if we want to follow these hackathons' approaches.