Open nikola-rados opened 3 years ago
Examining the snakeviz
output for a request of size 571mb (this is the reported size from Dataset.nbytes / 2
) we get a pretty clear picture of what is holding back the performance:
Note: Given the exact same parameters I've seen this time vary quite a bit, anywhere from late 200 seconds to early 400 seconds.
The Dataset.to_netcdf()
method takes up basically the entire runtime of the program. If we follow the call stack to the bottom, we see that the method is already using some threading to handle its execution:
Despite this is doesn't seem to do things particularly quickly (at least it feels that way). @cairosanders and I have already tried to incorporate asyncio
to simultaneously load the the individual requests but support from xarray
of asynchronous tasks is pretty limited. Also the main bottleneck of to_netcdf
still exists unfortunately.
I don't know what the performance requirements/expectations are for orca
but I get the feeling this may be a little too slow. As such I was hoping to open up some discussion about how to go about potentially speeding this up.
To add some more details the results above were achieved by running: make performance
which runs a test case that splits a single request into two. Here is a look at the parameters passed into the script (found in the link above):
scripts/process.py -u tasmax_day_BCCAQv2_bcc-csm1-1-m_historical-rcp26_r1i1p1_19500101-21001231_Canada -v tasmax[0:1:15000] -t [0:1:91] -n [0:1:206] -l DEBUG
The original request is split into these two requests:
'https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[0:1:7500][0:1:91][0:1:206]'
'https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[7501:1:15000][0:1:91][0:1:206]'
These are split in half on the time variable such that both requests are under the threshold.
Here is the full set of logs from the run:
2021-02-26 13:08:15 INFO: Processing data file request
2021-02-26 13:08:15 DEBUG: Starting db session
2021-02-26 13:08:15 DEBUG: Got filepath: /storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc
2021-02-26 13:08:15 DEBUG: Initial url: https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[0:1:15000][0:1:91][0:1:206]
2021-02-26 13:08:15 INFO: Downloading data file(s)
2021-02-26 13:08:16 DEBUG: Splitting, request over threshold: 571358088.0
2021-02-26 13:08:16 DEBUG: URL(s) for downloading: ['https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[0:1:7500][0:1:91][0:1:206]', 'https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[7501:1:15000][0:1:91][0:1:206]'])
2021-02-26 13:08:16 DEBUG: Downloading and merging 2 split files
2021-02-26 13:13:57 DEBUG: File writing complete
2021-02-26 13:13:57 INFO: Complete
While the script is working as intended thus far the performance may become a concern to its viability. With this issue we will seek out ways to improve the speed.