supermanzer / NDBC

Repository for housing Python code for fetching, parsing, and loading NDBC data into a local Python object for analysis.
MIT License
13 stars 4 forks source link

Integrate Realtime Data features #21

Open supermanzer opened 3 years ago

supermanzer commented 3 years ago

Realtime data files have not been QC'd but may prove useful to researchers wanting to examine more up-to-date data and perform their own quality control methods.

URL pattern appears to follow https://www.ndbc.noaa.gov/data/5day2/{station_id}_5day.txt format.

Due to the difference in sampling frequency and the lack of quality control standards, separate the data returned as a separate attribute in the DataBuoy.data property.

kylejb commented 3 years ago

Would you like some help with this feature?

supermanzer commented 3 years ago

@kylejb Sure! I've had a few requests for this but haven't managed to get to it just yet.

kylejb commented 3 years ago

TL;DR

Two questions to make sure I'm on the same page with you before digging deeper:

  1. Question to confirm endpoint: Realtime2 and/or 5day2.
  2. Question to confirm output objective: are we conforming data to fit the attributes as defined in DataBuoy.data property?
    • Should I treat DataBuoy.data as immutable and conform to that?

Clarifying Questions

  1. The Realtime2* data that comes from the following endpoint: https://www.ndbc.noaa.gov/data/realtime2/ undergoes automatic QC. Would you like to incorporate that instead?

URL pattern appears to follow https://www.ndbc.noaa.gov/data/5day2/{station_id}_5day.txt format.

This seems to be different from Realtime, so I just wanted to make sure which endpoint is the feature of interest.

  1. Are we conforming data to fit the attributes as defined in DataBuoy.data property? In other words, should I treat DataBuoy.data as immutable and conform output to your predefined structure?

*Footnote:

According to their documentation, here are the for Realtime2:

The Realtime directory... contains the current (last 45 days) data. The term Realtime refers to the version of the data. In general, Realtime data are the data that have undergone automated quality control checks as they were received in real time and released on the Global Telecommunications System (GTS). The files are named station_id.datatype.

As an example, the following files would apply to station 41002,

  • the standard meteorological data is in 41002.txt
  • the continuous winds data is in 41002.cwind
  • the spectral wave summary data is in 41002.spec
  • the raw spectral wave data is in 41002.data_spec

You will also find a ton of other useful information in the linked PDF, if you haven't seen it already. For example, there are endpoints for weather data types (e.g., waves metrics) which offers another way of segmenting data – an alternative to doing it by Buoy!

supermanzer commented 3 years ago

In response to clarifying questions:

  1. Realtime2 seems like what we are after, closer to the request made in #20
  2. I would appreciate all data being returned to be an attribute of the .data property. One of the reasons I had not got to this yet is I had not settled on exactly how I thought Realtime vs Summary data packages should be stored to ensure it is clear to researchers which sets of data are which.
    • I was torn between splitting all data into Realtime or Summary groups or having both classes of measurements associated with each data package.

Response to Footnote:

Thanks for this! I had not had a chance to review it but I'm looking forward to digging into it.

This project grew out of a set of classes I built in MATLAB for my own thesis research. That may explain the Buoy segmentation. I started out with a set of data for one location and then needed to use NDBC stations to help characterize nearshore and regional oceanographic conditions. Hence the focus on buoy stations and the bias to standard meteorological conditions.

kylejb commented 3 years ago

Realtime2 seems like what we are after, closer to the request made in #20

You got it!

I would appreciate all data being returned to be an attribute of the .data property. One of the reasons I had not got to this yet is I had not settled on exactly how I thought Realtime vs Summary data packages should be stored to ensure it is clear to researchers which sets of data are which.

Sounds good. I'll strive to just that and I'll document whatever may not fit. I'll be as conservative as possible since I'm not a scientist. :)

I was torn between splitting all data into Realtime or Summary groups or having both classes of measurements associated with each data package.

Yeah, I relate to that struggle when working on my surfing application. Intuitively the latter sounds like the right approach... in part due to the inconsistency with the data that I've noticed between data sources. I couldn't reconcile why the same data-type had different results.

Response to Footnote:

Thanks for this! I had not had a chance to review it but I'm looking forward to digging into it.

This project grew out of a set of classes I built in MATLAB for my own thesis research. That may explain the Buoy segmentation. I started out with a set of data for one location and then needed to use NDBC stations to help characterize nearshore and regional oceanographic conditions. Hence the focus on buoy stations and the bias to standard meteorological conditions.

Yeah, I understand. My project*, which is small potatoes compared to what you had to do, fit the Buoy segmentation model too! I think it makes sense for most contexts.

Down the road, perhaps, if we write the code in a particular way, we could make this library extendible to different interfaces? I was actually starting work on refactoring Buoy code* from my project into creating a Buoy interface to NDBC** when I stumbled on your repo! Would you be opposed to making your package the parent package with the buoy interface being an abstracted dependency? My primarily incentive with this idea is to learn/practice building a package; I'll get that done eventually for the experience, so no pressure... just food for thought.


*Project: https://github.com/kylejb/capstone-project_backend (written in Ruby) **Buoy-py: https://github.com/clairBuoyant/pybuoy

supermanzer commented 3 years ago

The scaffolding you've got built out in buoy-py is interesting. Honestly I started this project by writing a class and some methods in a Python file and this project could likely benefit from a more rigorous scaffolding. I've got a feature branch running for splitting the DataBuoy class up into separate classes that

  1. Handle the HTTP communication to/from the NDBC web server
  2. Handle processing the data returned

I am planning on inheriting both into the DataBuoy class so as to simplify the concerns and maintenance while maintaining backward capabilities.

I would be interested in at least providing the means to extend the DataBuoy class functionality to different interfaces. I have some long range plans of including some features to allow for wiring up an ORM and writing out to a database but that is a ways down the line. I'm not familiar with any parent-child relationships between packages but if my package can be of use to you, go for it.

kylejb commented 2 years ago

Any changes to requirements/objectives since we last spoke? I'm ready to help out again.

kylejb commented 1 year ago

FYI, I released pybuoy with this functionality.

Would this approach be useful for your project? Happy to help integrate it, if you'd like.