oceanmodeling / searvey

Sea state observational data retrieval
https://searvey.readthedocs.io/en/stable/
GNU General Public License v3.0
24 stars 11 forks source link

Add USGS storm based time series source #14

Closed saeed-moghimi-noaa closed 1 year ago

saeed-moghimi-noaa commented 2 years ago

@zacharyburnettNOAA @SorooshMani-NOAA

Adding USGS timeseries from https://stn.wim.usgs.gov/FEV/#FlorenceSep2018

This is a code I got from @flackdl . See this repo developed by Danny as well: https://github.com/flackdl/cwwed .

import os
import re
import sys
import errno
import requests
from io import open

EVENT_ID_MATTHEW = 135  # default

# capture an event_id from the command line, defaulting to Matthew
EVENT_ID = sys.argv[1] if len(sys.argv) > 1 else EVENT_ID_MATTHEW

# file type "data"
# https://stn.wim.usgs.gov/STNServices/FileTypes.json
FILE_TYPE_DATA = 2

# deployment types
# https://stn.wim.usgs.gov/STNServices/DeploymentTypes.json
DEPLOYMENT_TYPE_WATER_LEVEL = 1
DEPLOYMENT_TYPE_WAVE_HEIGHT = 2
DEPLOYMENT_TYPE_BAROMETRIC = 3
DEPLOYMENT_TYPE_TEMPERATURE = 4
DEPLOYMENT_TYPE_WIND_SPEED = 5
DEPLOYMENT_TYPE_HUMIDITY = 6
DEPLOYMENT_TYPE_AIR_TEMPERATURE = 7
DEPLOYMENT_TYPE_WATER_TEMPERATURE = 8
DEPLOYMENT_TYPE_RAPID_DEPLOYMENT = 9

# create output directory
output_directory = 'output'
try:
    os.makedirs(output_directory)
except OSError as exception:
    if exception.errno != errno.EEXIST:
        raise

# fetch event data files
files_req = requests.get('https://stn.wim.usgs.gov/STNServices/Events/{}/Files.json'.format(EVENT_ID))
files_req.raise_for_status()
files_json = files_req.json()

# fetch event sensors
sensors_req = requests.get('https://stn.wim.usgs.gov/STNServices/Events/{}/Instruments.json'.format(EVENT_ID))
sensors_req.raise_for_status()
sensors_json = sensors_req.json()

# filter sensors down to barometric ones
barometric_sensors = [sensor for sensor in sensors_json if sensor.get('deployment_type_id') == DEPLOYMENT_TYPE_BAROMETRIC]

# print file urls for barometric sensors for this event
for file in files_json:
    if file['filetype_id'] == FILE_TYPE_DATA and file['instrument_id'] in [s['instrument_id'] for s in barometric_sensors]:

        file_url = 'https://stn.wim.usgs.gov/STNServices/Files/{}/item'.format(file['file_id'])

        # fetch the actual file
        file_req = requests.get(file_url, stream=True)

        # capture the filename from the headers so we can save it appropriately
        match = re.match('.*filename="(?P<filename>.*)"', file_req.headers['Content-Disposition'])
        if match:
            filename = match.group('filename')
        else:
            filename = '{}.unknown'.format(file['file_id'])
            print('COULD NOT FIND "filename" in header, saving as {}'.format(filename))

        print('{}\t\t({})'.format(filename, file_url))

        with open('{}/{}'.format(output_directory, filename), 'wb') as f:
            for chunk in file_req.iter_content(chunk_size=1024):
                f.write(chunk)
ghost commented 2 years ago

thanks!

saeed-moghimi-noaa commented 2 years ago

Thanks to @flackdl just share the location of the latest file: https://github.com/flackdl/cwwed/blob/ad39f0e9bea6a0a3bdbc937fea41994f4ed359ba/scripts/usgs.py

ghost commented 2 years ago

great, I've made a first draft of the implementation of this here: https://github.com/oceanmodeling/StormEvents/blob/7054095b4cb54ac733ea40091a5a2ffa1210c50b/stormevents/usgs/events.py#L313-L375

saeed-moghimi-noaa commented 2 years ago

Thanks @zacharyburnettNOAA . See the email I just sent to Danny.

brey commented 2 years ago

@SorooshMani-NOAA provided some input via email. I repost here for completeness:

Today I noticed this package on GitHub: https://github.com/USGS-python/dataretrieval

I was wondering if this retrieves the same data that you were interested in or if there's another USGS database that you'd like to query?

This ones seems to have the following data available for retrieval: instantaneous values (iv) daily values (dv) statistics (stat) site info (site) discharge peaks (peaks) discharge measurements (measurements) water quality samples (qwdata)

which seems to be what the water services REST API provides: https://waterservices.usgs.gov/rest/

George, if this is the same database the Jack is interested in, does it make sense to add a "normalization" wrapper on top of the dataretrieval package or should searvey directly use REST API?

brey commented 2 years ago

I looked a bit into dataretrieval and looks good. It already has users, they are considering doing a conda package (see issue 44 therein) and the lead developer works for USGS which is beneficial for updates and access.

If it exposes all the data, then we can make a wrapper and use it as upstream dependency.

We can also invite Timothy Hodson to a meeting and discuss it.

SorooshMani-NOAA commented 2 years ago

Documenting relevant email between me and @Rjialceky (slightly modified):

[...] CSDL [...] is interested in the following observations in support of the coastal application teams modeling work for NOAA products and services:

  • Surface water level
  • Water level datums, relative and geodetic observations
  • Water temperature
  • Water salinity
  • Water currents

I am primarily interested in datum points in support of navigation products and services; and, where unavailable, interested in the surface water levels to formulate new datums. The challenge of course is to have searvey assemble available observations sourced from NOAA, IOC, USGS, etc. into the normalized categories above. In the case of USGS, the number of [potentially] available parameters to sort out from their observation sites looks especially large—so any software API / wrapper that makes that easier, maintainable, etc. should be leveraged:

https://waterdata.usgs.gov/nwis/uv?referred_module=sw&search_criteria=multiple_site_no&submitted_form=introduction

@brey @pmav99 I can't find the other ticket where we discussed normalization and/or standardization of the outputs. Given the quoted email above, how would you approach adding getter functions? Do we have a template to follow?

pmav99 commented 2 years ago

We don't have a "template". I added some thoughts of how the API could/should be in the wiki: https://github.com/oceanmodeling/searvey/wiki/API-design but feel free to open a new ticket to further discuss this.

SorooshMani-NOAA commented 2 years ago

So does that mean if we want to add USGS data we (for now) just need to return the raw output we get from their API? In this case, is it really meaningful to have a wrapper around USGS dataretrieval package? Because they're already returning a dataframe

SorooshMani-NOAA commented 1 year ago

Today I was exploring using dataretrieval package for obtaining USGS datasets. It seems that dataretrieval removes a lot of metadata from the NWIS response during the creation of data tables. For example when getting the "instantaneous value" record for a station we might have something like the following as response from the web API:

{
    "name": "USGS:0148472405:00035:00000",
    "sourceInfo": {
        "geoLocation": {
            "geogLocation": {
                "latitude": 38.1389722,
                "longitude": -75.18363889,
                "srs": "EPSG:4326"
            },
            "localSiteXY": []
        },
        "note": [],
        "siteCode": [
            {
                "agencyCode": "USGS",
                "network": "NWIS",
                "value": "0148472405"
            }
        ],
        "siteName": "BUNTINGS GUT NEAR CEDARTOWN, MD",
        "siteProperty": [
            {
                "name": "siteTypeCd",
                "value": "ST-TS"
            },
            {
                "name": "hucCd",
                "value": "02040303"
            },
            {
                "name": "stateCd",
                "value": "24"
            },
            {
                "name": "countyCd",
                "value": "24047"
            }
        ],
        "siteType": [],
        "timeZoneInfo": {
            "daylightSavingsTimeZone": {
                "zoneAbbreviation": "EDT",
                "zoneOffset": "-04:00"
            },
            "defaultTimeZone": {
                "zoneAbbreviation": "EST",
                "zoneOffset": "-05:00"
            },
            "siteUsesDaylightSavingsTime": true
        }
    },
    "values": [
        {
            "censorCode": [],
            "method": [
                {
                    "methodDescription": "",
                    "methodID": 234506
                }
            ],
            "offset": [],
            "qualifier": [
                {
                    "network": "NWIS",
                    "qualifierCode": "P",
                    "qualifierDescription": "Provisional data subject to revision.",
                    "qualifierID": 0,
                    "vocabulary": "uv_rmk_cd"
                }
            ],
            "qualityControlLevel": [],
            "sample": [],
            "source": [],
            "value": [
                {
                    "dateTime": "2022-12-06T12:00:00.000-05:00",
                    "qualifiers": [
                        "P"
                    ],
                    "value": "1.2"
                }
            ]
        }
    ],
    "variable": {
        "noDataValue": -999999.0,
        "note": [],
        "oid": "45807109",
        "options": {
            "option": [
                {
                    "name": "Statistic",
                    "optionCode": "00000"
                }
            ]
        },
        "unit": {
            "unitCode": "mph"
        },
        "valueType": "Derived Value",
        "variableCode": [
            {
                "default": true,
                "network": "NWIS",
                "value": "00035",
                "variableID": 45807109,
                "vocabulary": "NWIS:UnitValues"
            }
        ],
        "variableDescription": "Wind speed, miles per hour",
        "variableName": "Wind speed, mph",
        "variableProperty": []
    }
}

But the resulting data set only returns (examples not from the same station!):

                           00060 00060_cd     site_no  00065 00065_cd
datetime
2022-12-06 08:45:00-05:00   4.48        P  0148471320   3.72        P

Does this make sense then to instead use web API directly (going back to the original question!)? Since in any case we need to create tables of constants, such as parameter codes, quality codes, etc. It may be that dataretrieval doesn't really take much heavy lifting off of searvey development in the end.

There's also the delay in fixing issues in dataretrieval and waiting for it to get to conda for searvey to depend on it. Right now, for example, there are some issues when retrieving data from stations with different time zones that results in an exception.

SorooshMani-NOAA commented 1 year ago

After discussion the comment above with @pmav99 during data retrieval meeting, we decided it makes more sense to start calling the NWIS API directly to start with, and just use our own mapping of response to data frames.

brey commented 1 year ago

I understand the point but I wonder if we should bring this to the attention of Timothy first (with an issue on dataretrieval) and see what he has to say. Having said that I leave it up to you guys.

SorooshMani-NOAA commented 1 year ago

I think it would be better to do what you suggest. I already created an issue here https://github.com/USGS-python/dataretrieval/issues/59. In the last meeting only two of us were present, so I just wanted to relay what was discussed. I haven't yet implemented anything for USGS.

mroberge commented 1 year ago

There are a variety of Python packages that use the USGS API. I set up a discussion among the authors here: https://github.com/mroberge/hydrofunctions/issues/79

SorooshMani-NOAA commented 1 year ago

Thank you @mroberge this information is very helpful.

SorooshMani-NOAA commented 1 year ago

I just realized that the get_iv metadata item in the returned tuple can include information about the parameter code or site. I though that the metadata only includes header or url information, but if the right arguments are passed, more information is extracted and included. I think the main question now is how much we want to keep the data from REST API untouched?

For IOC and COOPS stations we pretty much return whatever is provided by the web services, but for USGS NWIS we have to do so transformation either way. Can we then just take output of dataretrieval (or even one of the other packages from https://github.com/oceanmodeling/searvey/issues/14#issuecomment-1346882195) to be the main source of data and just return that data with minimal changes to fit searvey API conventions?

cheginit commented 1 year ago

@mroberge, Thanks for mentioning HyRiver. As Martin said, PyGeoHydro includes a class called NWIS that provides access to several NWIS endpoints (you can check out this example notebook). Also, I developed robust and performant engines for working with web services (AsyncRetriever and PyGeoOGC), so feel free to explore them and let me know if you need any help.

SorooshMani-NOAA commented 1 year ago

@cheginit I learned about your toolset a couple of weeks ago when working on a different project. Your software stack is very impressive and useful, however since searvey is focused on giving access to the original data from the source at the lowest level, it makes more sense to use minimal packages like dataretrieval. With that being said, I'm looking forward to using your software stack in other projects.

SorooshMani-NOAA commented 1 year ago

@brey, @pmav99, @saeed-moghimi-noaa, if you haven't already, I highly recommend reading this summary by @mroberge: https://github.com/mroberge/hydrofunctions/issues/79. (mentioned in https://github.com/oceanmodeling/searvey/issues/14#issuecomment-1346882195)

After that I'd like us to re-evaluate why we want to add USGS support within searvey. My take is:

I'm just thinking out load, but given above (as opposed to what I said to @pmav99 the other day) maybe it makes more sense to follow the original plan of using dataretrieval package, and just assume the return values are the original data from source.

What do you think?

saeed-moghimi-noaa commented 1 year ago

@SorooshMani-NOAA

What you suggested make sense. I am fine with that. However I will let @brey and @pmav99 as the lead developers of searvey to have the final say.

Thanks,

brey commented 1 year ago

After the discussion with @SorooshMani-NOAA few days back and seeing his progress (!) using dataretrieval let's go with that. Thanks Soroosh.

I will close this issue and we can open more specific ones if needed during the implementation.