Github Action Workflows for Scheduled Algorithm Deployment

valentina-s commented 3 years ago

Develop a workflow using Github Actions to apply processing functions to a stream of data, for example:

Simple preprocessing
Calculate total power spectrum
Classify sounds (apply a pre-trained ML model to generate predictions)
Collect results and upload somewhere
Log process.

Github Actions can run sporadically at about 30 min rate but are free, so when data is also publicly available (which is the case with the OOI and Orcasound streams), this approach allows anybody to test their models on the long streams and then compare the results.

Required skills: Python, Github, basic signal processing Bonus skills: Docker, Cloud computing, deep learning Possible mentor(s): Valentina @valentina-s, Scott @scottveirs, Val @veirs References:

An example of a github action applied to the Ocean Observatories Initiative Hydrophone Stream: https://github.com/orcasound/orca-action-workflow

A package to read data from the Ocean Observatories Archive: https://ooipy.readthedocs.io/en/latest/

Getting Started:

Currently the script.py is only reading the file and creating a spectrogram, but it is not doing anything with it.
Calculate some properties based on it and decide how to organize the results.
Log properly when the pinging does not return anything
Experiment with longer periods of time, what are the limitations

Points to consider in the proposal: What are the inputs of the system, what are the outputs? Where will the inputs and outputs live? What are the simple operations that can be done within the Github actions? How can they be streamlined using Docker containers? How can they be extended using Cloud computing resources? How can the results be organized for easy access? Can this be extended for multiple processes?

veirs commented 3 years ago

@valentina-s , I see your CronPy inside your main.yml script which will call your script.py each 5 minutes after it is started. Where are you setting the conditions that will start CronPy running? Is it always running? Github actions are completely new to me. And, where is it running? On your local clone of the github?

valentina-s commented 3 years ago

CronPy is always running (based on the schedule). You can look at the progress here:

https://github.com/valentina-s/cron_action/actions

There is another workflow called Manual workflow which you can trigger manually, or make it work based on some event.

Molkree commented 3 years ago

Hi, @valentina-s! I have a few questions about this project, would appreciate more info!

What is the end goal? Is the intention to analyze the sound files to detect killer whales using tools from orcaml, orcaal or orcadata?

Yes, it is an opportunity to test currently existing and future algorithms (could be orca call detection, click detection, ship detection, etc.) directly on the streams of data. That would reveal more than a fixed dataset.

How "live" this should be? Using OOI like in your example workflow won't allow for it because it seems like they upload files once or twice a day. Using Orcasound streams should be better if we want to be more realtime if you give data sooner (I haven't tested yet).

It does not need to be that live, it can be per day, or per few hours, or even more rare. There are syncing issues with OOI that we are investigating if they can be sped up. But previous day can be a very valid approach. For Orcasound it can be more realtime.

Instead of spawning a new workflow run every 5 minutes it might be better to let one run for a few hours (one job can run for 6 hours max (docs). In the case of OOI data, if we figure out when they upload and if they do it consistently (or even better if they have documented when they do it) we can just run processing pipeline when needed and work on all new files.

There is a gap between 12 PM UTC and ~ 11PM UTC: so yes, working on the previous day of data can work too unless it is too much data and needs to be split.

If OOI uploads data inconsistently, we can cache the last known file(name) for the day and process files during the next workflow run only if they are new.

Yes, that would be good, though they do a big dump in the evening, so suddenly there will be a lot of new files! As I said, this may be improved on the OOI side, so just think of times when there is data for now.

Before we reach the end goal from (1), what kind of simple preprocessing do you have in mind? "Calculate total power spectrum", is it just to determine if there are loud noises in recording? I'll be honest, my DSP skills might be the weakest point for this project :) "Calculate some properties based on it and decide how to organize the results" I guess this is in the same vein, would love to hear what properties you want to know.

Yes, total power spectrum would be just summing up all the spectrogram pixel values within a time window to detect loud noises: this of course can be refined to look for noises in specific frequency bands, or directly apply some of the preprocessing and model scripts @kunakl07 created last summer. https://github.com/orcasound/orcaal-research/tree/master/src

"Collect results and upload somewhere". For starters we can upload results as artifacts (here I've added simple uploads to your example workflow so we can see spectrogram after the run). But this is only helpful for debug/development of course, I guess the idea is to send results of (some) processing to other Orcasound tools (potentially alerts of detected orcas if it happened recently?). +1

For example, the total power spectrum (or other statistics) could go in a csv file. Maybe then another github action can act on the output.
"Log process". What kind of logging? :) Simple prints inside the workflow, output log files, sending it somewhere else? "Log properly when there the piniging does not return anything". Do you mean when there is no data returned from either OOI/Orcasound (like how the script fails if there are no files in the directory) or the fact that our processing didn't detect anything of interest?

Yes, fails, summary statistics, outputs of the functions.
"Experiment with longer periods of time, what are the limitations" Not quite sure what this one means, like I said above individual jobs have 6 hour limit (and workflows 72 hours).

I was not sure if it can run continuously for 6 hours, or it can time out sooner. There might be some RAM limitations. The documentation said it can run every 5 min, but it does not, so there could be some limitations of a free service.

Most of these might be just my misunderstanding of the idea/project, I'll send a few more followup questions later :D

Molkree commented 3 years ago

More followup questions/thoughts:

"What are the inputs of the system, what are the outputs? Where will the inputs and outputs live?" a) "What are the inputs of the system?" As I don't fully understand the intent of this project I can only guess that inputs could be the argument to choose which datastream to process (OOI/Orcasound/etc). Possibly to choose what processing algorithm to apply to data? Maybe input would be dependent on the chosen algorithm, but I feel like then it would be best to create separate workflows for different algorithms. For example, if we have one (oversimplified) workflow that simply pulls new data, applies bandpass filter [2000, 6000] and posts results as artifacts/sends them somewhere, one might want to manually trigger a workflow for given day N and use bandpass filter [1000, 5000]. Then we would add appropriate inputs to the workflow but it all depends on type of algorithm used. Though one similar between them all would be to choose datasource/timeframe to analyze.

b) "what are the outputs?" Like I mentioned above my understanding is that an alert about detected killer whale/some other interesting activity is the expected end output. Would probably be in the form of a POST request to some of the Orcasound services. Another possibility is just processed audio files/spectrogram/etc. If this tool is the endpoint then just upload as artifacts, otherwise send it further in the pipeline.

c) "Where will the inputs live?" As inputs are just run arguments they are not stored anywhere explicitly, could be just workflow inputs entered during manual start, comment contents if workflow starts from comments on issues/PRs, arguments set by another GH Action and possibly a few other ways.

d) "Where will the outputs live?" Can upload them as artifacts to the workflow run, can commit them through PR to the repository if needed, can send them somewhere else. It should be noted that artifacts and run logs are retained for 90 days (400 for Enterprise) so they should be saved somewhere else if desired.
"What are the simple operations that can be done within the github actions?" I'm not sure what's considered simple, would love to know :) GitHub Actions are basically free on-demand virtual machines, think Google Colab but on different OSes and with tight GitHub integration.
"How can they be streamlined using Docker containers?" Using Docker images is really simple, you only need to specify container, it will be pulled for you and workflow step will run inside that container.
"How can they be extended using Cloud computing resources?" ~~Not sure about this one, GitHub doesn't provide GPUs or other heavy machinery, running inference is fine, but training ML models or doing something compute intensive would probably be a bad idea.~~ This only applies to GitHub-hosted runners, but GitHub allows self-hosted runners, meaning it's possible to configure runner on your own hardware/in the cloud.
"How can the results be organized for easy access?" See 9b if I understood the question correctly.
"Can this be extended for multiple processes?" If multiple processes means concurrent execution, then yes, separate jobs in a workflow can run in parallel. If multiple processes means different use cases then also yes, just create different workflows. There are limitations to both, mainly on the number of concurrent jobs (20 for free accounts, other accounts get more) and API requests/hour.

valentina-s commented 3 years ago

In the second set of comments, you are on the right path: there can be different inputs: the streams, or the way the streams are accessed: per file?, per unit time?, and the outputs will differ based on the functions. The point of that question is for students to describe the details in the proposal (like a few scenarios).

There are already some Docker images specifically for the Orcasound data. You can look at the orcaal-research repo, or ask @kunakl07 to direct you to them.

We have some extra cloud resources, so if the github actions resources are not sufficient, so having a way to utilize the cloud ones would be great!

valentina-s commented 3 years ago

@Molkree you are welcome to submit a PR with the spectrogram artifact. The images will most probably quickly start accumulating and taking up a lot of space. You can change the retention policy to delete them after a few days: retention-days: 1

Molkree commented 3 years ago

The images will most probably quickly start accumulating and taking up a lot of space.

@valentina-s Everything should be free for public repos so it's of no concern.

Storing artifacts uses storage space on GitHub. GitHub Actions usage is free for both public repositories and self-hosted runners. For private repositories, each GitHub account receives a certain amount of free minutes and storage, depending on the product used with the account.

From docs.

Alright, I'll send PR shortly, might also change it to process more than one file. And process the previous day by default, there were no new files today...

Molkree commented 3 years ago

@valentina-s, quick question, have you ever seen error 404?

if r == 'Response [404]':

from script

Edit: ahhh, alright, 404 is what I'm getting now.

Molkree commented 3 years ago

@Molkree you are welcome to submit a PR with the spectrogram artifact.

PR opened, it has a bit more than artifact uploading though 😅

orcasound / orcagsoc

Github Action Workflows for Scheduled Algorithm Deployment #25