ioos / ioos_metrics

Working on creating metrics for the IOOS by the numbers
https://ioos.github.io/ioos_metrics/
MIT License
2 stars 4 forks source link

refactor this repo #35

Open MathewBiddle opened 11 months ago

MathewBiddle commented 11 months ago

This process is very confusing ATM. I've tried to update the README to document how to update the webpages. However, there are lots of interweaving dependencies and step-wise processes that need to be executed in a specific way to make everything work.

I'm starting this issue to do two things.

  1. Document what I'm doing now.
  2. Make a plan for how to simplify the steps and, hopefully, resolve #29 in the end.

What is happening now:

  1. ATN GTS metrics https://ioos.github.io/ioos_metrics/gts_atn.html
    1. get calculated from https://github.com/ioos/ioos_metrics/blob/main/gts_atn_metrics.py (by ripping through an html directory) and saved to https://github.com/ioos/ioos_metrics/blob/main/gts/GTS_ATN_monthly_totals.csv.
    2. The result is then used to create the website via https://github.com/ioos/ioos_metrics/blob/main/website/create_gts_atn_landing_page.py.
  2. Regional GTS metrics https://ioos.github.io/ioos_metrics/gts_regional.html
    1. Calculated in https://github.com/ioos/ioos_metrics/blob/main/gts_regional_metrics.py from the data hosted at https://www.ndbc.noaa.gov/ioosstats/ and saved to https://github.com/ioos/ioos_metrics/tree/main/gts/.
      1. we are serving those source data via ERDDAP at https://erddap.ioos.us/erddap/search/index.html?page=1&itemsPerPage=1000&searchFor=GTS as well (I have a script to pull over new data from NDBC for ERDDAP, which I run manually).
        #! /bin/bash
        start=2018-01-01; # when NDBC started collecting data
        end=`date +%Y-%m-%d`
        while ! [[ $start > $end ]]; do
        date_fmt=$(date -d "$start" +%Y_%m)
        start=$(date -d "$start + 1 month" +%Y-%m-%d)
        echo "Downloading $date_fmt..."
        # IOOS Regional
        wget -N https://www.ndbc.noaa.gov/ioosstats/rpts/"$date_fmt"_ioos_regional.csv -nH -P ioos_regional -a logfile_regional.txt
        # NDBC
        wget -N https://www.ndbc.noaa.gov/ioosstats/rpts/"$date_fmt"_ndbc.csv -nH -P ndbc -a logfile_ndbc.txt
        # non-NDBC
        wget -N https://www.ndbc.noaa.gov/ioosstats/rpts/"$date_fmt"_non_ndbc.csv -nH -P non_ndbc -a logfile_non_ndbc.txt
        done
    2. The calculated quarterly files are then read by https://github.com/ioos/ioos_metrics/blob/main/website/create_gts_regional_landing_page.py to create the webpage.
  3. Asset inventory https://ioos.github.io/ioos_metrics/asset_inventory.html
    1. Generated from https://github.com/ioos/ioos-asset-inventory/blob/main/inventory_creation.ipynb and data saved to a yearly directory at https://github.com/ioos/ioos-asset-inventory/tree/main
    2. Those data are read by ERDDAP via manual git pull of ioos-asset-inventory repo on ERDDAP server.
    3. Webpage is then generated with https://github.com/ioos/ioos_metrics/blob/main/website/create_asset_inventory_page.py by reading data from ERDDAP.

How can this process be simplified to:

  1. Catch bugs
  2. Update with new data
  3. Make it relatively hands-off
  4. Ensure data are available for other uses (load into ERDDAP)
MathewBiddle commented 11 months ago

I think what I would like to see is all of the data used in the metrics website are made available through the IOOS ERDDAP. Then, lightweight scripts that bring the data in and make the webpage. The trouble I have is where to put the scripts used to generate the datasets that then get served on the IOOS ERDDAP?

MathewBiddle commented 11 months ago

current flow


%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#007396',
      'primaryTextColor': '#fff',
      'primaryBorderColor': '#003087',
      'lineColor': '#003087',
      'secondaryColor': '#007396',
      'tertiaryColor': '#CCD1D1'
    },
   'flowchart': { 'curve': 'basis' }
  }
}%%

flowchart LR

pA["html"]
A["gts_atn_metrics.py"]
B["GTS_ATN_monthly_totals.csv"]
C["create_gts_atn_landing_page.py"]

subgraph ATN
A --> pA
A --> B
C --> B
end

D["ioosstats/"]
E["gts_regional_metrics.py"]
F["ioos_metrics/tree/main/gts/"]
G["create_gts_regional_landing_page.py"]

subgraph GTS
E --> D
E --> F
G --> F
end

H["inventory_creation.ipynb"]
I["ioos-asset-inventory/tree/main"]
J["IOOS ERDDAP"]
K["create_asset_inventory_page.py"]

subgraph inventory
H --> I
I --> J
K --> J
end
MathewBiddle commented 11 months ago

I think it should look like


%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#007396',
      'primaryTextColor': '#fff',
      'primaryBorderColor': '#003087',
      'lineColor': '#003087',
      'secondaryColor': '#007396',
      'tertiaryColor': '#CCD1D1'
    },
   'flowchart': { 'curve': 'basis' }
  }
}%%

flowchart LR

L["ERDDAP"]
A["gts_atn_metrics.py"]
C["create_gts_atn_landing_page.py"]
E["gts_regional_metrics.py"]
G["create_gts_regional_landing_page.py"]
H["inventory_creation.ipynb"]
K["create_asset_inventory_page.py"]

E --> L
L --> G
A --> L
L --> C
H --> L
L --> K
MathewBiddle commented 5 months ago

I am also calculating IOOS by the Numbers in this notebook https://github.com/ioos/ioos_metrics/blob/main/IOOS_BTN.ipynb which writes to a csv file https://github.com/ioos/ioos_metrics/blob/main/ioos_btn_metrics.csv. I would like to define a process for running that notebook (or the code inside) and then write the data that could then be hosted on the IOOS ERDDAP https://erddap.ioos.us/erddap/index.html

related to #8

MathewBiddle commented 5 months ago

💡 for 2.a. (https://github.com/ioos/ioos_metrics/issues/35#issue-1845474591) I could run that shell script as a cron job on AWS. Then, we don't have to worry about the ERDDAP endpoint getting out of sync by forgetting to pull new data. Probably best to run on the 5th of each month...

MathewBiddle commented 5 months ago

setup the cron job:

$ crontab -l
0 12 5 * * get_data.sh

Will need to check on the 5th of the month if it ran.

ocefpaf commented 5 months ago

I could run that shell script as a cron job on AWS

Could it work as a GHA cronjob? Or there are reasons to no go that route?

MathewBiddle commented 5 months ago

I need the data where ERDDAP can access it. And https://erddap.ioos.us/erddap/index.html is currently on AWS.

MathewBiddle commented 5 months ago

Cron job functioned as expected https://erddap.ioos.us/erddap/tabledap/gts_ndbc_statistics.htmlTable?Year%2CMonth%2Ctime%2ClocationID%2Csponsor%2Cmet%2Cwave&Year%3E=%222024%22