ropensci / rerddap

R client for working with ERDDAP servers
https://docs.ropensci.org/rerddap
Other
40 stars 14 forks source link

AVHRR night SST including date not in range #93

Closed schckngs closed 3 years ago

schckngs commented 3 years ago

Hi, I'm using rerddap to download AVHRR Pathfinder nighttime SST (nceiPH53sstn1day). I have been searching and saving chunks of 1 month time periods, and for some date ranges at the beginning of the year it includes the last day of the previous year.

One example is Jan 2001:

library(rerddap)
sstInfo <- info("nceiPH53sstn1day")
avhrr <- griddap(sstInfo, latitude = c(10,60), longitude = c(-160,-120), time = c("2001-01-01", "2001-01-31),  fields = c("sea_surface_temperature","quality_level"))

unique(avhrr$data$time) returns:

[1] "2000-12-31T12:00:00Z" "2001-01-02T12:00:00Z" "2001-01-03T12:00:00Z" "2001-01-04T12:00:00Z"
 [5] "2001-01-05T12:00:00Z" "2001-01-06T12:00:00Z" "2001-01-07T12:00:00Z" "2001-01-08T12:00:00Z"
 [9] "2001-01-09T12:00:00Z" "2001-01-10T12:00:00Z" "2001-01-11T12:00:00Z" "2001-01-12T12:00:00Z"
[13] "2001-01-13T12:00:00Z" "2001-01-14T12:00:00Z" "2001-01-15T12:00:00Z" "2001-01-16T12:00:00Z"
[17] "2001-01-17T12:00:00Z" "2001-01-18T12:00:00Z" "2001-01-19T12:00:00Z" "2001-01-20T12:00:00Z"
[21] "2001-01-21T12:00:00Z" "2001-01-22T12:00:00Z" "2001-01-23T12:00:00Z" "2001-01-24T12:00:00Z"
[25] "2001-01-25T12:00:00Z" "2001-01-26T12:00:00Z" "2001-01-27T12:00:00Z" "2001-01-28T12:00:00Z"
[29] "2001-01-29T12:00:00Z" "2001-01-30T12:00:00Z" "2001-01-31T12:00:00Z"

This only happens for some years, I thought it was following leap years but the same as above for 2005 does not grab 2004-12-31. If it was a dateline issue I wouldn't think it would grab the whole image, since in this case 2001-01-01 has no data (or maybe this date is getting mislabelled?).

Cheers.

Session Info ```r setting value version R version 4.0.2 (2020-06-22) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctype English_United States.1252 tz America/Los_Angeles date 2020-10-28 ```
sckott commented 3 years ago

Thanks for the question @schckngs

@rmendels Do you have any insight here? It seems that at least in the case of this dataset there's no data on the 1st day of January of every year

rmendels commented 3 years ago

@sckott @schckngs First I checked that the data were, say for 2005-01-01, they are, see:

https://coastwatch.pfeg.noaa.gov/erddap/griddap/nceiPH53sstn1day.graph?sea_surface_temperature%5B(2005-01-01T12:00:00Z)%5D%5B(89.97917):(-89.97916)%5D%5B(-179.9792):(179.9792)%5D&.draw=surface&.vars=longitude%7Clatitude%7Csea_surface_temperature&.colorBar=%7C%7C%7C%7C%7C&.bgColor=0xffccccff

Then I looked at the output, and saw that times for the dataset are centered times, that is 2001-01-31T12:00:00Z, not 2001-01-31T00:00:00Z. Now when you give a request where one or more values is not on the grid (and in the case of your request neither is on the grid), ERDDAP changes to the "nearest" grid point for each. But in this case there is a tie, ERDDAP has a rule for breaking the tie, and it is not behaving as expected. In the case of 2001, the data for 2001-01-01 is indeed missing (I just checked our files - I will see if something went wrong in the downloading). So the closest date to the one you requested is indeed the 2000-12-31T12:00:00Z data.

If you can give me other years where there are problems, I will check, but I would also suggest using code such as:

library(rerddap)
sstInfo <- info("nceiPH53sstn1day")
avhrr <- griddap(sstInfo, latitude = c(10,60), longitude = c(-160,-120), time = c("2005-01-01T12:00:00Z", "2005-05-31T12:00:00Z"),  fields = c("sea_surface_temperature","quality_level"))

first and see if the problem persists, and then send me the years that are not working.

HTH

rmendels commented 3 years ago

@sckott @schckngs

I should add that I just did a quick check. January is missing for 2001 and 2003, is there for the there years. Tomorrow I will try to figure out why those files are not there, but those should be the years where you should be seeing the behavior you described.

HTH

rmendels commented 3 years ago

@sckott @schckngs

Okay I went back to the source, and the Jan 01 2003 files were there, I added them to our files, it may take some time for ERDDAP to recalibrate the times in the datasets. The Jan 01 2001 files however are also missing at the source (NCEI). I will try to notify them, but we usually don't have much sway on these matters.

HTH

schckngs commented 3 years ago

Thanks for the details and for checking. I confirmed today that this is only happening for 2001 and 2003. I will add a time stamp to the griddap() query in my script.

This dataset should be updated very soon, so I wonder if any missing images will be included in that.

rmendels commented 3 years ago

@sckott @schckngs

It looks like in the last couple of days they included data through September (we have it through June). I am in the midst of downloading the new data. It may take a few days for everything to happen that is needed to add the new data to the service.

schckngs commented 3 years ago

Great, I will look out for the update! If I notice any images that are missing in the ERDDAP dataset but exist on the NCEI site I can pass that along.

sckott commented 3 years ago

thanks for your help @rmendels

rmendels commented 3 years ago

@sckott - not a problem. We take seriously helping people get the data they want, and fixing things as quickly as possible when things aren't right.

rmendels commented 3 years ago

@schckngs @sckott

For whatever reason the downloads are taking forever, right now the server has up through 9/19. My guess is it should be through September by tomorrow morning, I will check then and force a reload if the files are there and haven't shown up yet in the server.

schckngs commented 3 years ago

@rmendels

Thanks, I will check again next week!

So as for this original issue, are we considering it closed? I suppose the original problem was the missing data, but regardless, the "centered times" makes it possible to retrieve data before the time period being searched.

It's easy to prevent or remove data from a date prior to what is being queried, but for gappier datasets / smaller regions I would think this is likely to pop up again.

rmendels commented 3 years ago

@schckngs @sckott

I would say closed. I know that this can sound like just brushing you off, but there always is the option to get the times first, to check the ranges. The behavior in ERDDAP won't change. If you think about it, what to do when a request doesn't fall on a grid point is a difficult problem, and there are several choices you can make each of which has problems. There are a lot or different reasons for not being on a grid point , and ERDDAP doesn't know the reason. Even more, if you think about it, it is difficult for a service to know that a datapoint is missing (it is obvious to humans), since not all datasets are evenly spaced. And the behavior you are seeing will only happen if one of the (Start, end) times involves the missing data.

I would add that the second rerddap vignette has a section at the end that clearly discusses what happens when you make a request. The idea is quite simple - find the "nearest" grid point on one end and start he extract there, find the "nearest" grid point on the other end and stop the extract there, and then extract every "stride" point, where "stride" is defined in array-space (if "stride" were in coordinate space, you could have a situation where none or almost none of the requested points fall on a gridpoint, and then you have an interpolation problem that we feel the user should do, because the choice of algorithm can vary so widely).

Given how long ERDDAP has been around it will not change this choice, because that would make past extracts non-reproducible, and our ERDDAP alone gets between 500,000 -1,000,000 requests a day.

And a PS to @sckott - I know you are probably very busy, but a nice feature to add to rerddap that I think I have mentioned before is to be able to extract selected coordinate values independent of the data. But then you'd have to re-submit to CRAN. I just got snarked by Ripley, but I hit back. I am 70, 14 years past when I could retire, I don't need to take that from anyone.

sckott commented 3 years ago

extract selected coordinate values independent of the data.

not sure I follow. if there's not an open issue, open one up so we can discuss in a separate issue

rmendels commented 3 years ago

@sckott the issue that Andrea raised should be closed. The rerddap was correct, and I explained to her why. But I did suggest she could also check the times first, but right now you can't get just the coordinate values in rerddap. Just opened a new issue for that.

sckott commented 3 years ago

okay, thanks