ropensci / rrricanes

Web scraper for Atlantic and east Pacific hurricanes and tropical storms
https://docs.ropensci.org/rrricanes
Other
21 stars 12 forks source link

Text Product Refactoring #113

Open timtrice opened 6 years ago

timtrice commented 6 years ago

Currently, rrricanes scrapes the National Hurricane Center's front-end website for tropical cyclone advisory data. Because of this setup, users are not able to download a specific advisory or a set of advisories within a given time period, among other limitations.

For example, if I wanted to download only the advisories for Hurricane Harvey in a given 72-hour period, I would not be able to. I would need to access a list of all tropical cyclones for that period, pass the storm's name to another function that would scrape that storm's archive page for the product, and then wait for all text products to be pulled, parsed, and reformatted into a tidy format.

This can be a time-consuming task. It is particularly noticeable when building the monthly releases for rrricanesdata.

The individual text files do exists on the NHC's FTP server. It is assumed these are issued in real-time but cannot be guaranteed (modified dates appear to match the issue date, but the times are the same for all products at 1900 UTC).

There are two locations for these text products, depending on the storm being accessed. As of this writing (2018-03-24), all storms in 2016 and prior are in the archives (ftp://ftp.nhc.noaa.gov/atcf/archive/) directory (see subdirectory MESSAGES) . This directory does not contain storms for the 2017 season. Those are located:

A list of the "current year's" storms can also be found in the index subdirectory (ftp://ftp.nhc.noaa.gov/atcf/index/).

The most recent position of each storm can be found in the adv subdirectory (ftp://ftp.nhc.noaa.gov/atcf/adv/)

I want to make accessing the FTP server the default with the fallback to the NHC's front-end website. I do not want to create new functions to handle this. So, perhaps add a parameter for users to pass if the explicitly want the front-end. Or, hit the FTP site and then, if the product does not exist, revert to the HTML website.

Note: FTP links apparently do not work on GitHub under standard markdown, nor anchor elements.

timtrice commented 6 years ago

FTP ATCF: ftp://ftp.nhc.noaa.gov/atcf/

ATCF Notice: ftp://ftp.nhc.noaa.gov/atcf/NOTICE

ATCF README: ftp://ftp.nhc.noaa.gov/atcf/README

ATCF TROPICAL CYCLONE DATABASE Manual: https://www.nrlmry.navy.mil/atcf_web/docs/database/new/database.html

timtrice commented 6 years ago

The FTP server seems very disorganized and there is potential the structure may change which would break any functionality dependent upon it.

At this time I'm going to leave the current handling of text products as is (will still clean up code, add comments, etc.).

I will add FTP handling as a new set of functions, perhaps adding "ftp" into function names. For example, "get_fstadv" would have a FTP counterpart "get_ftp_fstadv".

This seems to be the best method for now to ensure previous code works as expected and also give a second option to obtain even more data.

timtrice commented 5 years ago

Added some alternate handling of storm data.

get_storm_list - Retrieves a listing of all cyclones in a master "database" on the NHC's FTP server. This master database lists all known storms and includes some INVEST and GENESIS systems though no advisories are issued for these.

This function should help users quickly find a storm by year, name, strength and such. It is much faster than the current usage of get_storms and should be the preferable method.

NOTE this data is incomplete; some variables (such as ending datetime) are NA and other variables are just the status of the system at the end of it's lifespan, not the maximum status achieved. It should only be used to list known cyclones.

get_ftp_storm_data is comparable to get_storm_data with the exception that it does not take a vector of links but, rather, a key (stormid). These are the unique identifiers for every tropical cyclone.

The function will take the stormid and products, access the FTP server and scrape the requested data. It then returns a dataframe.

NOTE one product request should be passed at a time. And, it is encouraged that one key be passed at a time. Currently, there are no time restrictions (as exist with get_storm_data). This is because most all cyclones will not have more than 80 text statements per product (the requested limit per the NHC is 80 requests per 10 seconds). This should become a TODO but I'm not sure yet how I want to handle this.

Function get_ftp_dirs is a helper function that will retrieve a list of contents from a FTP directory.

Documentation has also be added.

Adding a vignette should be another TODO but I will wait until I have more time to test the functionality and timing aspects.

Examples

#' Load a list of all storms in the ftp's `storm_list` page
storm_list <- get_storm_list()
Observations: 2,578
Variables: 21
$ STORM_NAME  <chr> "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNN...
$ RE          <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "...
$ X           <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L"...
$ R2          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ R3          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ R4          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ R5          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ CY          <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6...
$ YYYY        <int> 1851, 1851, 1851, 1851, 1851, 1851, 1852, 1852, 1852, 1852, 1852, 1853, 1853, 1853, 1853, 1853, 1853, 1...
$ TY          <chr> "HU", "HU", "TS", "HU", "TS", "TS", "HU", "HU", "HU", "HU", "HU", "TS", "TS", "HU", "HU", "TS", "HU", "...
$ I           <chr> "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"...
$ YYY1MMDDHH  <dttm> 1851-06-25 00:00:00, 1851-07-05 12:00:00, 1851-07-10 12:00:00, 1851-08-16 00:00:00, 1851-09-13 00:00:0...
$ YYY2MMDDHH  <dttm> 1851-06-28 00:00:00, 1851-07-05 12:00:00, 1851-07-10 12:00:00, 1851-08-27 18:00:00, 1851-09-16 18:00:0...
$ SIZE        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ GENESIS_NUM <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ PAR1        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ PAR2        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ PRIORITY    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ STORM_STATE <chr> "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARC...
$ WT_NUMBER   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ STORMID     <chr> "AL011851", "AL021851", "AL031851", "AL041851", "AL051851", "AL061851", "AL011852", "AL021852", "AL0318...
# Return a dataframe of all fstadv products issued for the respective cyclones
AL092017 <- get_ftp_storm_data("AL092017", products = "fstadv")
AL142018 <- get_ftp_storm_data("AL142018", products = "fstadv")