noaa-onms / cinms

Channel Islands National Marine Sanctuary
https://noaa-onms.github.io/cinms
MIT License
3 stars 1 forks source link

duplication problem in sst data #57

Closed superjai closed 3 years ago

superjai commented 3 years ago

Hey @bbest - I have run into a strange duplication problem in the SST data. I thought I would make you aware of it.

In the SST data, here's the data file, I noticed that the February and March data for any given year are duplicates of each other. I thought at first that there was some bug in the nms4r code, but I went through the code carefully and I don't think it is a problem with the code we wrote. I think instead that it is either a data problem or a problem with the rerddap package. Using rerddap::griddap, I pulled the SST raster for the Channel Islands area and here is what it looks like. You can see that Feb and March are the same.

This one is for Feb 2003:

nc <-    rerddap::griddap(
  rerddap::info("jplMURSST41mday"),
  url = "https://coastwatch.pfeg.noaa.gov/erddap/",
  time = c("2003-02-01", "2003-02-28"),
  latitude = c(33.36241, 34.20707) , longitude = c(-118.9071, -120.6421),
  fields = "sst", fmt = 'nc')
r <- raster::raster(nc$summary$filename)
raster::plot(r)

feb_2003

This one is March 2003: march_2003

And here is April 2003: april_2003

However, if I check out the raster on the erddap data portal, it looks quite different.

Here is Feb 2003. Screen Shot 2020-11-01 at 7 34 21 PM

March 2003 Screen Shot 2020-11-01 at 7 34 49 PM

April 2003 Screen Shot 2020-11-01 at 7 35 07 PM

bbest commented 3 years ago

Deep dive debug by git cloning https://github.com/ropensci/rerddap into ~/github/rerddap and inserting browser() at line 47 of R/grid.R:

devtools::load_all("~/github/rerddap")

nc <-    rerddap::griddap(
  rerddap::info("jplMURSST41mday"),
  url = "https://coastwatch.pfeg.noaa.gov/erddap/",
  time = c("2003-02-01", "2003-02-28"),
  latitude = c(33.36241, 34.20707) , longitude = c(-118.9071, -120.6421),
  fields = "sst", fmt = 'nc')
r <- raster::raster(nc$summary$filename)
raster::plot(r)
Browse[1]> 
debug at /Users/bbest/github/rerddap/R/grid.R#48: resp <- erd_up_GET(url = sprintf("%sgriddap/%s.%s", url, d, fmt), 
    dset = d, args = args, store = store, fmt = fmt, callopts)
Browse[2]> sprintf("%sgriddap/%s.%s", url, d, fmt)
[1] "https://upwell.pfeg.noaa.gov/erddap/griddap/jplMURSST41mday.nc"
Browse[2]> d
[1] "jplMURSST41mday"
Browse[2]> args
[1] "sst[(2003-02-01):1:(2003-02-28)][(33.36241):1:(34.20707)][(-120.6421):1:(-118.9071)]"
Browse[2]> store
$store
[1] "disk"

$path
[1] "/Users/bbest/Library/Caches/R/rerddap"

$overwrite
[1] TRUE

Browse[2]> fmt
[1] "nc"
Browse[2]> callopts
list()

See Vignette generated by usethis::use_vignette("debug_cinms-issue57_erddap") in Articles of nms4r:

superjai commented 3 years ago

Hey @bbest - I did some digging and it looks like the data duplication problem is occurring somewhere on the ERDDAP server side of things. I have emailed Bob Simons at NOAA to let him know. The rerddap::griddap function seems to be forming urls properly.

Let me walk you through my process. Following your suggestion, I examined the URLs created by rerddap::griddap that are passed along to ERDDAP servers. With a date of February 2003, I tried the following:

nc <-    rerddap::griddap(
  rerddap::info("jplMURSST41mday"),
  url = "https://coastwatch.pfeg.noaa.gov/erddap/",
  time = c("2003-02-01", "2003-02-28"),
  latitude = c(33.36241, 34.20707) , longitude = c(-118.9071, -120.6421),
  fields = "sst", fmt = 'nc')

And rerddap::griddap created the following url to send along:

"https://upwell.pfeg.noaa.gov/erddap/griddap/jplMURSST41mday.nc?sst[(2003-02-01):1:(2003-02-28)][(33.36241):1:(34.20707)][(-120.6421):1:(-118.9071)]"

With a date of March 2003, I tried the following:

nc <-    rerddap::griddap(
  rerddap::info("jplMURSST41mday"),
  url = "https://coastwatch.pfeg.noaa.gov/erddap/",
  time = c("2003-03-01", "2003-03-31"),
  latitude = c(33.36241, 34.20707) , longitude = c(-118.9071, -120.6421),
  fields = "sst", fmt = 'nc')

And got the following response:

"https://upwell.pfeg.noaa.gov/erddap/griddap/jplMURSST41mday.nc?sst[(2003-03-01):1:(2003-03-31)][(33.36241):1:(34.20707)][(-120.6421):1:(-118.9071)]"

You can see that the URLs from griddap seem to be being formed correctly.

I then downloaded data files directly from the ERDDAP data access form for jplMURSST41mday. For Feb (February 1-28, 2003) and March (March 1-31, 2003), I downloaded the SST data (in csv form) for the rectangle defined by the latitudes and longitudes shown above, like in the example below: Screen Shot 2020-11-05 at 11 32 37 PM Both of these csv files are saved to the cinms/data/oceano folder for now (see commit just before this comment, where I uploaded the files).

If you check out the file for march, you'll see that it contains the data for february (despite the specified date range going only from March 1 - March 31). I am thinking this error has something to do with the short length of the month of Feb. When I pull the csv for March 2 - March 31, then I get the correct data(no feb data).

superjai commented 3 years ago

Hey @bbest - I heard from Bob Simons. This isn't a bug exactly, as far as he is concerned. Here is his response.


If you look all of at the time values for this dataset (which has monthly data) https://coastwatch.pfeg.noaa.gov/erddap/griddap/jplMURSST41mday.htmlTable?time%5B(2002-06-16):1:(2020-10-16)%5D you'll see that the time values are all the 16th of the month -- a stand-in for the middle of the month. So if you want data for March, ask for data from March 16.

When you ask for an axis (e.g., time) value from a gridded dataset in ERDDAP, ERDDAP returns the closest available value. See https://coastwatch.pfeg.noaa.gov/erddap/griddap/documentation.html#parentheses So if you ask for data for March 1, you will get data from Feb 16, which is closer than Mar 16.

I hope that makes sense.

superjai commented 3 years ago

Hey @bbest - okay I think I have now fixed this issue. I have updated the data csv files and interactive rmds.

It is strange that the ERDDAP data range that you ask for is not actually the data range that you necessarily get. At any rate, to make sure that this isn't a problem going forward, I changed the date slice called for by ply2erddap. Instead of the date slice being the entire month, I have narrowed the date slice down to be just the day of the month which is the data output for that dataset. Like so:

  if (erddap_id == "nesdisVHNSQchlaMonthly") {
    m_date <- lubridate::ymd(glue::glue("{year}-{month}-01"))
  } else if (erddap_id == "jplMURSST41mday" || erddap_id == "erdMWchlamday") {
    m_date <- lubridate::ymd(glue::glue("{year}-{month}-16"))
  } else {
    stop("Error in erddap_id: the function ply2erddap only currently knows how to
         handle the datasets nesdisVHNSQchlaMonthly, jplMURSST41mday, and erdMWchlamday")
  }
  m_dates <- c(m_date, m_date)

As I think I said above, I then reran the nms4r functions for calculating the data tables, rerendered the interactive rmd windows, and then pushed all of that up. So, I think that takes care of this issue.