Accessing downscaled_gcms using NetCDF from R (or Python) gives unexpected results.

ConorIA commented 6 years ago

Hello. This may be the wrong place to post this, but I landed here via the FAQ in the user docs for the data portal.

I've been trying from both Python and R to access the downscaled_gcms catalogue programatically. However, the values that I am getting out are incorrect. I might be missing an authentication step somewhere, but it seems the OpenID login page at https://data.pacificclimate.org/esg-orp/j_spring_openid_security_check.htm is no longer online.

The steps to reproduce (in R at least) are here:

library(ncdf4)
library(ncdf4.helpers)

nc = nc_open('https://data.pacificclimate.org/data/downscaled_gcms/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc')

time = nc.get.time.series(nc, "tasmin")
lon <- ncvar_get(nc, "lon")
lat <- ncvar_get(nc, "lat")

I am looking for data from 1981-01-01 from Toronto (43.67, -79.40)

time_cell = which(as.character(time) == "1981-01-01")
lon_cell = which.min(abs(lon - -79.40))
lat_cell = which.min(abs(lat - 43.67))

print(sprintf("Getting data starting at T = %i, X = %i, and Y = %i", time_cell, lon_cell, lat_cell))
#> [1] "Getting data starting at T = 11316, X = 740, and Y = 33"

tasmax_1 = nc.get.var.subset.by.axes(nc, "tasmax", list(T = time_cell:length(time), X = lon_cell, Y = lat_cell))
plot(x = 1:length(tasmax_1), tasmax_1, ylim = c(-1, 1))

The definitely doesn't look right. So I'll try the whole time series for the cell.

tasmax_2 = nc.get.var.subset.by.axes(nc, "tasmax", list(T = 1:length(time), X = lon_cell, Y = lat_cell))
plot(x = 1:length(tasmax_2), tasmax_2, ylim = c(-20, 40))

Still not right. This is not however, the case, when I use the online map to select data points. For instance (from bash):

hget "https://data.pacificclimate.org/data/downscaled_gcms/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc.nc?tasmax[0:55115][32:32][739:739]&"
mv 'pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc.nc?tasmax[0:55115][32:32][739:739]&' 'pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc.nc'

downloaded_nc <- nc_open("/home/conor/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc.nc")

tasmax_3 <- ncvar_get(downloaded_nc, "tasmax")

plot(x = 1:length(tasmax_3), tasmax_3)

I must be missing something. Any advice would be greatly appreciated.

EDIT: I had mixed tasmin and tasmax in my original reprex.
EDIT2: Is it typical to get < 500 bit/s download speeds from the data portal? I'm getting 649k over 25 minutes.

corviday commented 5 years ago

Just to confirm, you resolved the original question?

As far as download speeds, the data portal is under pretty heavy use right now, and download speeds are currently horribly slow.

ConorIA commented 5 years ago

Thanks for your comment. I have not resolved the question. I just fixed the reprex, so that I was consistent in the variables being queried. To summarize: can the netCDF files be accessed directly or must they first be downloaded locally? Only the first few time steps are returning valid data when accessed directly from Python or R.

jameshiebert commented 5 years ago

I just spent a few minutes looking at this and something is definitely out of sorts. Your use case of accessing the datasets directly from R or Python over OPeNDAP is definitely one of our intended features, so this is a great bug report.

So the difference between accessing "directly" from R/python and downloading over HTTP is which PyDAP responder gets invoked. The former uses the DODS reponder, while the latter uses the NetCDF responder.

Apparently there is a bug in the DODS responder. Watching the logs, I can see the responder get invoked, lots of warnings happening, and then at some point the backend crashes and gunicorn spins up another worker. And the client complains about not getting all of the data that was promised (presumably the content-length header doesn't match up with what was delivered).

On the client:

> nc = nc_open('https://services.pacificclimate.org/data/downscaled_gcms/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc')
> x <- ncvar_get(nc, 'pr', start=c(740, 33, 11316), count=c(1, 1, 100))
CURL Error: Transferred a partial file

On the server:

2019-01-31 20:31:15 [17] [INFO] 172.17.0.1 - - [31/Jan/2019:20:31:15 +0000] "GET /downscaled_gcms/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc.dods?time HTTP/1.1" 200 441083 "-" "oc4.4.0"
/usr/local/lib/python2.7/dist-packages/gevent/builtins.py:96: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  result = _import(*args, **kwargs)
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute DIMENSION_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute DIMENSION_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute DIMENSION_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:48 [8] [CRITICAL] WORKER TIMEOUT (pid:18)
2019-01-31 20:31:48 [8] [CRITICAL] WORKER TIMEOUT (pid:18)
2019-01-31 20:31:50 [18] [INFO] Worker exiting (pid: 18)
2019-01-31 20:31:50 [18] [INFO] Worker exiting (pid: 18)
2019-01-31 20:31:50 [18] [INFO] Sending a poison pill to the queue runner
2019-01-31 20:31:50 [18] [INFO] Sending a poison pill to the queue runner
2019-01-31 20:31:50 [44] [INFO] Consumer-2: Exiting
2019-01-31 20:31:50 [32] [INFO] Consumer-1: Exiting
2019-01-31 20:31:50 [67] [INFO] Booting worker with pid: 67
2019-01-31 20:31:50 [67] [INFO] Booting worker with pid: 67

This is definitely a serious bug, so we'll try to take some time to look into it further. In the mean time, your workaround is to download the data with the NetCDF handler, constructing an OPeNDAP URL. You can even do this programattically, pretty easily:

library(ncdf4)
library(curl)
> curl_download("https://data.pacificclimate.org/data/downscaled_gcms/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+inmcm4_historical+rcp85_r1i1p1_19500101-21001231.nc.nc?pr[0:100][112:112][158:158]&", dest)
> nc <- nc_open(dest)
> x <- ncvar_get(nc, 'pr')
> range(x)
[1]  0.0000 56.1875
> hist(x, plot=F)
$breaks
[1]  0 10 20 30 40 50 60

$counts
[1] 57 25  9  6  2  2

$density
[1] 0.056435644 0.024752475 0.008910891 0.005940594 0.001980198 0.001980198

$mids
[1]  5 15 25 35 45 55

$xname
[1] "x"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"
> x
  [1]  9.50000 12.78125  0.59375  0.71875 21.62500 16.37500 13.12500 11.93750
  [9]  3.12500  1.21875 14.75000 15.50000 12.71875 29.09375  8.68750  6.65625
 [17] 13.50000 39.75000  7.21875 18.59375 30.06250  4.90625 46.93750 25.37500
 [25]  1.78125 12.90625  1.93750  8.43750  4.96875  0.00000  0.00000  0.03125
 [33]  0.00000  0.00000  0.00000  0.00000 11.40625  2.03125  0.00000  0.00000
 [41] 12.34375 11.46875 36.43750 16.81250  0.00000 15.46875 13.78125 14.31250
 [49] 23.46875  3.18750  8.65625  0.87500  0.00000  0.00000  0.00000  0.00000
 [57]  0.00000  0.00000 25.62500 34.93750 11.87500 37.06250 12.90625 56.18750
 [65] 17.34375 30.53125  0.50000 11.18750 15.21875  2.03125  8.37500  3.43750
 [73]  0.81250  5.50000 10.59375 22.09375 20.68750 46.28125  8.21875  3.90625
 [81] 11.75000  7.68750  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
 [89]  0.00000  0.00000  0.00000  0.00000  0.00000  7.87500 23.37500  2.68750
 [97]  1.71875 12.71875 55.87500  6.62500 23.78125
> file.remove(dest)
TRUE

In any case, thanks for the report, and sorry to take so long to get back to you.

Bonus: we've been looking at all of the 500 errors in the logs from the user-agent "oc4.4.0" and had been wondering what the heck that was. You've helped us solve the mystery (since it seems to be the DODS client coming from R).

ConorIA commented 5 years ago

Hi @jameshiebert, thanks so much for looking into this. I guess this is as good a place as any to mention. The download speeds also seem to have an interesting behaviour. A request of ~~about 500 KiB~~ ~542 KiB flies for the first ~~400 KiB or so~~ 432 KiB, and then crawls for hours on the last ~~100 or so~~ 110 KiB.

e.g. for https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_CCSM4_historical+rcp45_r2i1p1_19500101-21001231.nc.nc?tasmin[0:55115][32:32][739:739]

Notice the unit inversion around 80%.

Working on  tasmin_day_BCCAQv2+ANUSPLIN300_CCSM4_historical+rcp45_r2i1p1_19500101-21001231.nc

  0%|          | 0.00/541 [00:00<?, ?KB/s]
  1%|          | 4.00/541 [00:00<00:16, 32.0KB/s]
  7%|▋         | 37.0/541 [00:00<00:11, 42.4KB/s]
 12%|█▏        | 63.0/541 [00:00<00:08, 53.7KB/s]
 16%|█▌        | 86.0/541 [00:00<00:07, 64.9KB/s]
 22%|██▏       | 118/541 [00:00<00:05, 79.9KB/s] 
 28%|██▊       | 150/541 [00:01<00:04, 95.1KB/s]
 34%|███▎      | 182/541 [00:01<00:03, 111KB/s] 
 40%|███▉      | 214/541 [00:01<00:02, 137KB/s]
 45%|████▌     | 246/541 [00:01<00:01, 148KB/s]
 51%|█████▏    | 278/541 [00:01<00:01, 176KB/s]
 56%|█████▌    | 301/541 [00:01<00:01, 172KB/s]
 63%|██████▎   | 342/541 [00:01<00:01, 185KB/s]
 72%|███████▏  | 390/541 [00:02<00:00, 203KB/s]
 72%|███████▏  | 390/541 [00:17<00:00, 203KB/s]
 80%|████████  | 435/541 [02:31<01:45, 1.00KB/s]
 81%|████████  | 436/541 [05:24<1:32:01, 52.6s/KB]
 81%|████████  | 437/541 [08:19<2:34:52, 89.4s/KB]
 81%|████████  | 438/541 [11:12<3:16:25, 114s/KB] 
 81%|████████  | 439/541 [14:11<3:47:20, 134s/KB]
 81%|████████▏ | 440/541 [17:05<4:05:52, 146s/KB]
etc. etc.
 99%|█████████▉| 537/541 [4:30:09<09:35, 144s/KB]
 99%|█████████▉| 538/541 [4:32:35<07:12, 144s/KB]
100%|█████████▉| 539/541 [4:34:56<04:46, 143s/KB]
100%|█████████▉| 540/541 [4:37:21<02:23, 144s/KB]
100%|██████████| 541/541 [4:39:46<00:00, 144s/KB]

I've also found that any non-zero start time index causes the download to fail.

corviday commented 5 years ago

Thank you for bringing the issue with non-zero start time indices to our attention. I have opened a separate issue for it and am working on it now.

ConorIA commented 5 years ago

Thanks @corviday. I'll subscribe there.

pacificclimate / pdp

Accessing downscaled_gcms using NetCDF from R (or Python) gives unexpected results. #94