ropensci / rerddap

R client for working with ERDDAP servers
https://docs.ropensci.org/rerddap
Other
40 stars 14 forks source link

feature request: Pacific Decadal Oscillation dataset #101

Closed reblake closed 3 years ago

reblake commented 3 years ago

I just installed this package because I was hoping to programmatically download a table dataset, Pacific Decadal Oscillation (PDO) data, from ERDDAP via this package. Sadly it's not listed as available when I searched with this code:

library(rerddap)
ed_search(query = 'size', which = "table")

An example of the PDO data I'm looking for is here: https://oceanview.pfeg.noaa.gov/erddap/tabledap/cciea_OC_PDO.htmlTable?&time%3E=1900-01-01&time%3C=2021-09-01

While it's possible to scrape this html (what I'm currently doing), it's not a very reproducible method because 1) it relies on the URL not breaking, and 2) updating the dates in the URL itself each time I want to scrape more data.

Hopefully this dataset can be made available via this package in the future.
Thanks!

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] xml2_1.3.2      rerddap_0.7.6   rvest_1.0.0     curl_4.3.2      XML_3.99-0.6   
 [6] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4     readr_1.4.0    
[11] tidyr_1.1.3     tibble_3.1.2    ggplot2_3.3.5   tidyverse_1.3.1 httr_1.4.2     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7        lubridate_1.7.10  lattice_0.20-44   assertthat_0.2.1 
 [5] digest_0.6.27     utf8_1.2.1        R6_2.5.1          cellranger_1.1.0 
 [9] backports_1.2.1   reprex_2.0.0      pillar_1.6.2      rlang_0.4.11     
[13] readxl_1.3.1      rnoaa_1.3.4       rstudioapi_0.13   data.table_1.14.0
[17] vegan_2.5-7       Matrix_1.3-3      urltools_1.7.3    splines_4.1.0    
[21] triebeard_0.3.0   munsell_0.5.0     broom_0.7.9       compiler_4.1.0   
[25] modelr_0.1.8      pkgconfig_2.0.3   mgcv_1.8-35       tidyselect_1.1.1 
[29] gridExtra_2.3     httpcode_0.3.0    fansi_0.5.0       permute_0.9-5    
[33] hoardr_0.5.2      crayon_1.4.1      dbplyr_2.1.1      withr_2.4.2      
[37] rappdirs_0.3.3    MASS_7.3-54       crul_1.1.0        grid_4.1.0       
[41] nlme_3.1-152      jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.0  
[45] DBI_1.1.1         magrittr_2.0.1    scales_1.1.1      ncdf4_1.17       
[49] cli_3.0.1         stringi_1.6.2     fs_1.5.0          ellipsis_0.3.2   
[53] generics_0.1.0    vctrs_0.3.8       tools_4.1.0       glue_1.4.2       
[57] hms_1.1.0         parallel_4.1.0    colorspace_2.0-2  cluster_2.1.2    
[61] haven_2.4.3      
rmendels commented 3 years ago

@reblake The data is indeed available, that server is just not one that is included in the search list.

library(rerddap)
myURL <- 'https://oceanview.pfeg.noaa.gov/erddap/'
info <- info('cciea_OC_PDO', url = myURL)
info
<ERDDAP info> cciea_OC_PDO 
 Base URL: https://oceanview.pfeg.noaa.gov/erddap 
 Dataset Type: tabledap 
 Variables:  
     PDO: 
         Range: -3.6, 3.51 
     time: 
         Range: -2.2089888E9, 1.627776E9 
         Units: seconds since 1970-01-01T00:00:00Z 
pdo <- tabledap('cciea_OC_PDO', url = myURL)
str(pdo)
Classes ‘tabledap’ and 'data.frame':    1460 obs. of  2 variables:
 $ time: chr  "1900-01-01T00:00:00Z" "1900-02-01T00:00:00Z" "1900-03-01T00:00:00Z" "1900-04-01T00:00:00Z" ...
 $ PDO : num  0.04 1.32 0.49 0.35 0.77 0.65 0.95 0.14 -0.24 0.23 ...
 - attr(*, "datasetid")= chr "cciea_OC_PDO"
 - attr(*, "path")= chr "/Users/rmendels/Library/Caches/R/rerddap/e1e9bdd384a02460fe2c8a60d873f00d.csv"
 - attr(*, "url")= chr "https://oceanview.pfeg.noaa.gov/erddap/tabledap/cciea_OC_PDO.csv?"

You just have to point 'rerddap' to the correct server

reblake commented 3 years ago

Wow @rmendels, thanks! That's great news, and saves me 15 lines of code. I could not find this documented anywhere in the docs for the package at https://docs.ropensci.org/rerddap/articles/Using_rerddap.html , and I was careful to search before even considering posting an issue here. However, if there is documentation of this case that I missed, could you please post the link? Without my desired dataset listed as an available server in ed_search() or servers(), and without your above method documented in the package docs, how is a user supposed to understand what data is accessible via this package?
Thanks again.

rmendels commented 3 years ago

@reblake I am unclear on what you are saying. the 'tabletop()' function is indeed documented and one of the vignettes gives many examples of using it (https://docs.ropensci.org/rerddap/articles/Using_rerddap.html), including setting which ERDDAP the data is on (look at the IOOS Glider example). The vignettes are no longer included in the package because building them was creating too many problems with CRAN checks, but links to the vignettes are given.

There are something like 90-100 ERDDAPs worldwide, so it is difficult to document how to find a particular dataset in all of those ERDDAPs. There is the site http://erddap.com/#search but being included in that is voluntary, and if you search for PDO there it does not come up. But that is why we (ERD) provide expensive support, though I can't say that any of us know all of the data on all of the ERDDAPs.

If I am feeling motivated I may add a search capability that uses http://erddap.com/#search, but even more external access can make for difficulties with passing all of the CRAN tests. But can I point out that you did find the oceanview ERDDAP and that it had the PDO, so it was more a case of understanding how to use 'tabledap()'

reblake commented 3 years ago

@rmendels, I'm sorry I wasn't clear. I found the PDO data via a web browser by pointing and clicking. I did not find it when searching using the rerddap package. So, no, my issue was not a case of "understanding how to use tabledap()".

Let me try to rephrase my question from above. How is a user supposed to figure out if the dataset they want to access is accessible via the rerddap package if it is not listed in either servers() or ed_search()?
Example: I looked for the PDO dataset using these functions, but didn't find it so concluded it wasn't accessible via the rerddap package, and stopped there. There was no point in trying any other functions in the package (ex: tabledap()) if the data weren't available via the package.

My second point about the documentation boils down to this. I couldn't find a a mention of servers() or ed_search() returning incomplete results of what is accessible via this package in the package documentation here https://docs.ropensci.org/rerddap/index.html, nor could I find a method illustrated for finding datasets that don't appear in servers() or ed_search() . Your comment above was the first explanation and illustration of these two points I found.

Hope this is clearer.

rmendels commented 3 years ago

@reblake Thanks. As i said, there are some 90-100 ERDDAPs worldwide, each run by independent organizations. 'rerddap' will work with any ERDDAP if you have the URL.

That of course begs the question of how to find the different ERDDAPs. The 'rerddap' functions that lists the institutions and servers include the ones we have permission to list. People have different reasons for not being included in the list, sometimes it is just because they haven't bothered to be included in the list. But either way we will not list a server without explicit permission.

As for your statement "the data weren't available via the package", I am sorry that is a misleading statement. The package is not limited to the ERDDAP's listed in its functions, it works with any ERDDAP server, including the one at https://oceanview.pfeg.noaa.gov/erddap . That is a different question than how to find that ERDDAP. Clearly you did find it.

I don't see that this discussion is going anywhere so I am closing the issue.