Open lsetiawan opened 3 years ago
Hi @lsetiawan, just wanted to check in to see how you're doing with this. The architecture of Pangeo Forge has evolved quite a bit since you first opened this Issue, so I wouldn't be surprised if your early experiments have needed / will need to be updated. Since this is on some level a design question related to pangeo-forge-recipes
, perhaps we should move the discussion to https://github.com/pangeo-forge/pangeo-forge-recipes/issues.
Hi all!
I am currently trying to use the concepts in pangeo-forge and apply it to OOI Data in https://github.com/ooi-data. Firstly, I think that this is a really great project evolution for pangeo and I am looking forward to see where it's headed. Currently I am working on trying to convert a lot of OOI data into zarr files. I have tried working with prefect and dask for this, and just running that using K8s cronjob, but stumbled into a lot of roadblocks in terms of getting status and history of the data pipeline, and seeing if things broke in the process.
After running across pangeo-forge a few months back, I really loved the idea of being able to create a data pipeline that combined github actions and prefect! However, I couldn't fully integrate the current pangeo-forge to my needs since this idea is limited to having the source dataset already set to be pulled in a server somewhere. OOI system works differently than other system where the user has to request the data/wait/then fetch the data. There wasn't a way to do that step in pangeo-forge that I can visibly see. One solution that I thought might work is have a fetch and wait task within the prefect flow, but that means a lot of sitting around for the kubernetes pod.
So because of those roadblocks from pangeo-forge, I decided to take the concepts and use https://github.com/pangeo-forge/terraclimate-feedstock as an example to create a pangeo-forge-esq POC system, where I have github actions to perform those request and wait step, and then a step for the actual processing.
I also added the idea of having a sort of history for request and processing that is tracked by git to provide a full provenance for the dataset. This is not fully baked but you can see a sort of history example for request and process.
Then there's another issue of being able to replicate this for all the datasets that OOI have. So I decided to utilize the github templates to have a nice template to copy from: https://github.com/ooi-data/stream_template. And found away to keep all the dataset repos in sync with the template by using https://github.com/koj-co/update-template.
The processing currently is not actually running anything but I am still working on the backend logistics within my K8s cluster. But you can see on screenshot below of the running pipeline.
I just thought I should share my experiences with combining the power of Github Actions + Prefect to create a data pipeline from the ideas of pangeo-forge. Thank you for creating this great project and laying out the roadmap. I hope my ideas and prototype would spark other ideas. I would love to end up porting all the stuff I have over to pangeo-forge and contribute to the project to any capacity that I can at some point :smile: