Open ConorIA opened 6 years ago
Just to confirm, you resolved the original question?
As far as download speeds, the data portal is under pretty heavy use right now, and download speeds are currently horribly slow.
Thanks for your comment. I have not resolved the question. I just fixed the reprex, so that I was consistent in the variables being queried. To summarize: can the netCDF files be accessed directly or must they first be downloaded locally? Only the first few time steps are returning valid data when accessed directly from Python or R.
I just spent a few minutes looking at this and something is definitely out of sorts. Your use case of accessing the datasets directly from R or Python over OPeNDAP is definitely one of our intended features, so this is a great bug report.
So the difference between accessing "directly" from R/python and downloading over HTTP is which PyDAP responder gets invoked. The former uses the DODS reponder, while the latter uses the NetCDF responder.
Apparently there is a bug in the DODS responder. Watching the logs, I can see the responder get invoked, lots of warnings happening, and then at some point the backend crashes and gunicorn
spins up another worker. And the client complains about not getting all of the data that was promised (presumably the content-length header doesn't match up with what was delivered).
On the client:
> nc = nc_open('https://services.pacificclimate.org/data/downscaled_gcms/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc')
> x <- ncvar_get(nc, 'pr', start=c(740, 33, 11316), count=c(1, 1, 100))
CURL Error: Transferred a partial file
On the server:
2019-01-31 20:31:15 [17] [INFO] 172.17.0.1 - - [31/Jan/2019:20:31:15 +0000] "GET /downscaled_gcms/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc.dods?time HTTP/1.1" 200 441083 "-" "oc4.4.0"
/usr/local/lib/python2.7/dist-packages/gevent/builtins.py:96: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
result = _import(*args, **kwargs)
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute DIMENSION_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute DIMENSION_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute DIMENSION_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:19 [18] [WARNING] Failed to convert attribute REFERENCE_LIST
2019-01-31 20:31:48 [8] [CRITICAL] WORKER TIMEOUT (pid:18)
2019-01-31 20:31:48 [8] [CRITICAL] WORKER TIMEOUT (pid:18)
2019-01-31 20:31:50 [18] [INFO] Worker exiting (pid: 18)
2019-01-31 20:31:50 [18] [INFO] Worker exiting (pid: 18)
2019-01-31 20:31:50 [18] [INFO] Sending a poison pill to the queue runner
2019-01-31 20:31:50 [18] [INFO] Sending a poison pill to the queue runner
2019-01-31 20:31:50 [44] [INFO] Consumer-2: Exiting
2019-01-31 20:31:50 [32] [INFO] Consumer-1: Exiting
2019-01-31 20:31:50 [67] [INFO] Booting worker with pid: 67
2019-01-31 20:31:50 [67] [INFO] Booting worker with pid: 67
This is definitely a serious bug, so we'll try to take some time to look into it further. In the mean time, your workaround is to download the data with the NetCDF handler, constructing an OPeNDAP URL. You can even do this programattically, pretty easily:
library(ncdf4)
library(curl)
> curl_download("https://data.pacificclimate.org/data/downscaled_gcms/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+inmcm4_historical+rcp85_r1i1p1_19500101-21001231.nc.nc?pr[0:100][112:112][158:158]&", dest)
> nc <- nc_open(dest)
> x <- ncvar_get(nc, 'pr')
> range(x)
[1] 0.0000 56.1875
> hist(x, plot=F)
$breaks
[1] 0 10 20 30 40 50 60
$counts
[1] 57 25 9 6 2 2
$density
[1] 0.056435644 0.024752475 0.008910891 0.005940594 0.001980198 0.001980198
$mids
[1] 5 15 25 35 45 55
$xname
[1] "x"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
> x
[1] 9.50000 12.78125 0.59375 0.71875 21.62500 16.37500 13.12500 11.93750
[9] 3.12500 1.21875 14.75000 15.50000 12.71875 29.09375 8.68750 6.65625
[17] 13.50000 39.75000 7.21875 18.59375 30.06250 4.90625 46.93750 25.37500
[25] 1.78125 12.90625 1.93750 8.43750 4.96875 0.00000 0.00000 0.03125
[33] 0.00000 0.00000 0.00000 0.00000 11.40625 2.03125 0.00000 0.00000
[41] 12.34375 11.46875 36.43750 16.81250 0.00000 15.46875 13.78125 14.31250
[49] 23.46875 3.18750 8.65625 0.87500 0.00000 0.00000 0.00000 0.00000
[57] 0.00000 0.00000 25.62500 34.93750 11.87500 37.06250 12.90625 56.18750
[65] 17.34375 30.53125 0.50000 11.18750 15.21875 2.03125 8.37500 3.43750
[73] 0.81250 5.50000 10.59375 22.09375 20.68750 46.28125 8.21875 3.90625
[81] 11.75000 7.68750 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[89] 0.00000 0.00000 0.00000 0.00000 0.00000 7.87500 23.37500 2.68750
[97] 1.71875 12.71875 55.87500 6.62500 23.78125
> file.remove(dest)
TRUE
In any case, thanks for the report, and sorry to take so long to get back to you.
Bonus: we've been looking at all of the 500 errors in the logs from the user-agent "oc4.4.0" and had been wondering what the heck that was. You've helped us solve the mystery (since it seems to be the DODS client coming from R).
Hi @jameshiebert, thanks so much for looking into this. I guess this is as good a place as any to mention. The download speeds also seem to have an interesting behaviour. A request of about 500 KiB ~542 KiB flies for the first 400 KiB or so 432 KiB, and then crawls for hours on the last 100 or so 110 KiB.
Notice the unit inversion around 80%.
Working on tasmin_day_BCCAQv2+ANUSPLIN300_CCSM4_historical+rcp45_r2i1p1_19500101-21001231.nc
0%| | 0.00/541 [00:00<?, ?KB/s]
1%| | 4.00/541 [00:00<00:16, 32.0KB/s]
7%|▋ | 37.0/541 [00:00<00:11, 42.4KB/s]
12%|█▏ | 63.0/541 [00:00<00:08, 53.7KB/s]
16%|█▌ | 86.0/541 [00:00<00:07, 64.9KB/s]
22%|██▏ | 118/541 [00:00<00:05, 79.9KB/s]
28%|██▊ | 150/541 [00:01<00:04, 95.1KB/s]
34%|███▎ | 182/541 [00:01<00:03, 111KB/s]
40%|███▉ | 214/541 [00:01<00:02, 137KB/s]
45%|████▌ | 246/541 [00:01<00:01, 148KB/s]
51%|█████▏ | 278/541 [00:01<00:01, 176KB/s]
56%|█████▌ | 301/541 [00:01<00:01, 172KB/s]
63%|██████▎ | 342/541 [00:01<00:01, 185KB/s]
72%|███████▏ | 390/541 [00:02<00:00, 203KB/s]
72%|███████▏ | 390/541 [00:17<00:00, 203KB/s]
80%|████████ | 435/541 [02:31<01:45, 1.00KB/s]
81%|████████ | 436/541 [05:24<1:32:01, 52.6s/KB]
81%|████████ | 437/541 [08:19<2:34:52, 89.4s/KB]
81%|████████ | 438/541 [11:12<3:16:25, 114s/KB]
81%|████████ | 439/541 [14:11<3:47:20, 134s/KB]
81%|████████▏ | 440/541 [17:05<4:05:52, 146s/KB]
etc. etc.
99%|█████████▉| 537/541 [4:30:09<09:35, 144s/KB]
99%|█████████▉| 538/541 [4:32:35<07:12, 144s/KB]
100%|█████████▉| 539/541 [4:34:56<04:46, 143s/KB]
100%|█████████▉| 540/541 [4:37:21<02:23, 144s/KB]
100%|██████████| 541/541 [4:39:46<00:00, 144s/KB]
I've also found that any non-zero start time index causes the download to fail.
Thank you for bringing the issue with non-zero start time indices to our attention. I have opened a separate issue for it and am working on it now.
Thanks @corviday. I'll subscribe there.
Hello. This may be the wrong place to post this, but I landed here via the FAQ in the user docs for the data portal.
I've been trying from both Python and R to access the
downscaled_gcms
catalogue programatically. However, the values that I am getting out are incorrect. I might be missing an authentication step somewhere, but it seems the OpenID login page at https://data.pacificclimate.org/esg-orp/j_spring_openid_security_check.htm is no longer online.The steps to reproduce (in R at least) are here:
I am looking for data from 1981-01-01 from Toronto (43.67, -79.40)
The definitely doesn't look right. So I'll try the whole time series for the cell.
Still not right. This is not however, the case, when I use the online map to select data points. For instance (from bash):
I must be missing something. Any advice would be greatly appreciated.
EDIT: I had mixed
tasmin
andtasmax
in my original reprex.EDIT2: Is it typical to get < 500 bit/s download speeds from the data portal? I'm getting 649k over 25 minutes.