Closed jameshiebert closed 1 year ago
An exampel download of many observations yielded an interesting result.
Summary: Error is OperationalError: (psycopg2.extensions.QueryCanceledError) canceling statement due to statement timeout
Select station near Vernon (lower left in image below; selected by rectangle):
Here's (part of) the metadata:
The resulting download is a zip file with the following contents:
Extracting the .nc
file shows rather interesting (and not .nc type) contents:
$ cat ~/Downloads/23097.nc
Error {
code = -1;
message = "Traceback (most recent call last):
File \"/usr/local/lib/python2.7/dist-packages/pydap/handlers/lib.py\", line 73, in __call__
app = self.responses[response](dataset)
File \"/usr/local/lib/python2.7/dist-packages/pydap/responses/netcdf/__init__.py\", line 58, in __init__
n = len(seq)
File \"/usr/local/lib/python2.7/dist-packages/pydap/model.py\", line 435, in __len__
def __len__(self): return len(self.data)
File \"/usr/local/lib/python2.7/dist-packages/pydap/handlers/sql/__init__.py\", line 316, in __len__
data = conn.execute(*self.query)
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py\", line 1154, in execute
bind, close_with_result=True).execute(clause, params or {})
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py\", line 945, in execute
return meth(self, multiparams, params)
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py\", line 263, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py\", line 1053, in _execute_clauseelement
compiled_sql, distilled_params
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py\", line 1189, in _execute_context
context)
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py\", line 1402, in _handle_dbapi_exception
exc_info
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/compat.py\", line 203, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py\", line 1182, in _execute_context
context)
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py\", line 470, in do_execute
cursor.execute(statement, parameters)
OperationalError: (psycopg2.extensions.QueryCanceledError) canceling statement due to statement timeout
[SQL: 'SELECT RELATIVE_HUMIDITY1, STANDARD_SNOW, MINIMUM_AIR_TEMPERATURE, max_wnd_spd_10m_pst1hr, MEASURED_WIND_SPEED1, MEASURED_WIND_DIRECTION1, dwpt_temp, min_air_temp_snc_last_reset, DEW_POINT, ATMOSPHERIC_PRESSURE, avg_wnd_dir_10m_pst10mts, ACTUAL_WIND_DIRECTION, MAXIMUM_AIR_TEMPERATURE, snw_dpth, MAXIMUM_MEASURED_WIND_SPEED1, HEIGHT_OF_SNOW, pcpn_amt_pst24hrs, CURRENT_AIR_TEMPERATURE2, CURRENT_AIR_TEMPERATURE1, PRECIPITATION_NEW, max_air_temp_snc_last_reset, PRECIP_DETECTOR_RATIO, ACTUAL_WIND_SPEED, WIND_DIRECTION_STD_DEVIATION1, avg_wnd_spd_10m_pst10mts, air_temp, mslp, pcpn_amt_pst1hr, snwfl_amt_pst1hr, rel_hum, obs_time, stn_pres FROM (SELECT obs_time,MAX(CASE WHEN vars_id=445 THEN datum END) as ACTUAL_WIND_DIRECTION,MAX(CASE WHEN vars_id=444 THEN datum END) as ACTUAL_WIND_SPEED,MAX(CASE WHEN vars_id=721 THEN datum END) as air_temp,MAX(CASE WHEN vars_id=439 THEN datum END) as ATMOSPHERIC_PRESSURE,MAX(CASE WHEN vars_id=731 THEN datum END) as avg_wnd_dir_10m_pst10mts,MAX(CASE WHEN vars_id=729 THEN datum END) as avg_wnd_spd_10m_pst10mts,MAX(CASE WHEN vars_id=434 THEN datum END) as CURRENT_AIR_TEMPERATURE1,MAX(CASE WHEN vars_id=436 THEN datum END) as CURRENT_AIR_TEMPERATURE2,MAX(CASE WHEN vars_id=438 THEN datum END) as DEW_POINT,MAX(CASE WHEN vars_id=724 THEN datum END) as dwpt_temp,MAX(CASE WHEN vars_id=450 THEN datum END) as HEIGHT_OF_SNOW,MAX(CASE WHEN vars_id=720 THEN datum END) as max_air_temp_snc_last_reset,MAX(CASE WHEN vars_id=433 THEN datum END) as MAXIMUM_AIR_TEMPERATURE,MAX(CASE WHEN vars_id=449 THEN datum END) as MAXIMUM_MEASURED_WIND_SPEED1,MAX(CASE WHEN vars_id=730 THEN datum END) as max_wnd_spd_10m_pst1hr,MAX(CASE WHEN vars_id=447 THEN datum END) as MEASURED_WIND_DIRECTION1,MAX(CASE WHEN vars_id=446 THEN datum END) as MEASURED_WIND_SPEED1,MAX(CASE WHEN vars_id=722 THEN datum END) as min_air_temp_snc_last_reset,MAX(CASE WHEN vars_id=435 THEN datum END) as MINIMUM_AIR_TEMPERATURE,MAX(CASE WHEN vars_id=733 THEN datum END) as mslp,MAX(CASE WHEN vars_id=726 THEN datum END) as pcpn_amt_pst1hr,MAX(CASE WHEN vars_id=732 THEN datum END) as pcpn_amt_pst24hrs,MAX(CASE WHEN vars_id=443 THEN datum END) as PRECIP_DETECTOR_RATIO,MAX(CASE WHEN vars_id=441 THEN datum END) as PRECIPITATION_NEW,MAX(CASE WHEN vars_id=437 THEN datum END) as RELATIVE_HUMIDITY1,MAX(CASE WHEN vars_id=723 THEN datum END) as rel_hum,MAX(CASE WHEN vars_id=727 THEN datum END) as snw_dpth,MAX(CASE WHEN vars_id=728 THEN datum END) as snwfl_amt_pst1hr,MAX(CASE WHEN vars_id=451 THEN datum END) as STANDARD_SNOW,MAX(CASE WHEN vars_id=725 THEN datum END) as stn_pres,MAX(CASE WHEN vars_id=448 THEN datum END) as WIND_DIRECTION_STD_DEVIATION1 from obs_raw WHERE (history_id = 3433) AND vars_id IN (445,444,721,439,731,729,434,436,438,724,450,720,433,449,730,447,446,722,435,733,726,732,443,441,437,723,727,728,451,725,448) GROUP BY obs_time ORDER BY obs_time) as foo LIMIT 9223372036854775807 OFFSET 0']
Aside: What's with the double //
in the URL?
https://services.pacificclimate.org/met-data-portal-pcds/api/data//pcds/......
The obvious suspicion is that the database connection is timing out, as we saw for crmprtd
in March.
Therefore EY deployed a dev instance with a db connection string with keepalives, thus:
$ cat be-data.env
PCDS_DSN=postgresql://httpd:PASSWORD@db.pcic.uvic.ca/crmp?keepalives=1&keepalives_idle=300&keepalives_interval=300&keepalives_count=9
...
Here's a try with curl
:
$ curl "https://services.pacificclimate.org/dev/met-data-portal-pcds/api/data//pcds/agg/?from-date=&to-date=&network-name=&input-vars=&input-freq=&input-polygon=MULTIPOLYGON%20(((-119.308252%2050.209165%2C%20-119.308252%2050.21286%2C%20-119.304226%2050.21286%2C%20-119.304226%2050.209165%2C%20-119.308252%2050.209165)))&only-with-climatology=&download-timeseries=Timeseries&data-format=nc" -o test-download.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 691 0 691 0 0 13 0 --:--:-- 0:00:50 --:--:-- 0
curl: (18) transfer closed with outstanding read data remaining
The downloaded file is gobbledeygook as text, and not recognized as a zip file or as a .nc
file.
That 691 byte file size seems to crop up frequently when there is a problem. But that does not seem to advance our investigation here.
A very similar result is obtained with curl
on the URL provided by the production app:
$ curl "https://services.pacificclimate.org/met-data-portal-pcds/api/data//pcds/agg/?from-date=&to-date=&network-name=&input-vars=&input-freq=&input-polygon=MULTIPOLYGON%20(((-119.314056%2050.206113%2C%20-119.314056%2050.215456%2C%20-119.296609%2050.215456%2C%20-119.296609%2050.206113%2C%20-119.314056%2050.206113)))&only-with-climatology=&download-timeseries=Timeseries&data-format=nc" -o test-download-prod.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 691 0 691 0 0 13 0 --:--:-- 0:00:50 --:--:--
Note identical size and time. These are probably clues.
But the results are not quite identical:
$ cmp test-download.zip test-download-prod.zip
test-download.zip test-download-prod.zip differ: byte 11, line 1
The results are different when the Download Timeseries button is clicked in the prod and dev apps.
In the prod app, we get the result shown above (with the screenshots).
In the dev app we get the result shown below (top right); no file is actually downloaded, so its contents cannot be checked.
That SQL query that you show above takes nearly 30 seconds to run, so a timeout issue sounds probably.
crmp=> SELECT RELATIVE_HUMIDITY1, ...
Time: 27513.105 ms (00:27.513)
Our production deployment does not have keepalives set in the database DSN...
Yes, but the dev deployment does, and it misbehaves too. :(
But it may be misbehaving differently. It's hard to tell. Still looking. Need a brief walk to clear head.
Possibly a very significant finding. I began to suspect the exact 50-s fail time on each download. So I thought, maybe HAProxy is doing something? As a test, I tried going direct to the server and it downloaded a good file with very plausible contents:
l$ curl --max-time 900 --connect-timeout 300 "http://docker-prod02.pcic.uvic.ca:30503/api/data/pcds/agg/?from-date=&to-date=&network-name=&input-vars=&input-freq=&input-polygon=MULTIPOLYGON%20(((-119.314056%2050.206113%2C%20-119.314056%2050.215456%2C%20-119.296609%2050.215456%2C%20-119.296609%2050.206113%2C%20-119.314056%2050.206113)))&only-with-climatology=&download-timeseries=Timeseries&data-format=nc" -o test-download-prod.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7257k 0 7257k 0 0 83184 0 --:--:-- 0:01:29 --:--:-- 1797k
Contents of downloaded file:
I think the--max-time
and --connect-timeout
options are unnecessary, but they are leftover from a try against the public URL as follows:
$ curl --max-time 900 --connect-timeout 300 "https://services.pacificclimate.org/met-data-portal-pcds/api/data/pcds/agg/?from-date=&to-date=&network-name=&input-vars=&input-freq=&input-polygon=MULTIPOLYGON%20(((-119.314056%2050.206113%2C%20-119.314056%2050.215456%2C%20-119.296609%2050.215456%2C%20-119.296609%2050.206113%2C%20-119.314056%2050.206113)))&only-with-climatology=&download-timeseries=Timeseries&data-format=nc" -o test-download-prod.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 691 0 691 0 0 13 0 --:--:-- 0:00:50 --:--:-- 0
curl: (18) transfer closed with outstanding read data remaining
The problem was not the database connection proper timing out. It was, as I hinted at above, that HAProxy has a default connection (different sense of that word) timeout of 50 s. For large downloads from Met Data Portal there is a gap, presumably while it marshals and formats the data, that is longer than 50 s.
Therefore I asked Tom to increase the timeout on the prod and dev MDP data backends to 5 min. This fixed the problem, which can be observed with both curl
and by a live test with the app.
No change to code or deployment, only to HAProxy configuration. For those interested, Tom says:
[root@proxy01 haproxy]# diff haproxy.cfg haproxy.cfg.20230614.tk
677,680d676
< # hack for https://github.com/pacificclimate/pdp/issues/294#issuecomment-1592098943
< timeout server 5m
< timeout client 5m
< # end hack
686,689d681
< # hack for https://github.com/pacificclimate/pdp/issues/294#issuecomment-1592098943
< timeout server 5m
< timeout client 5m
< # end hack
I'm not sure I'd call that a "hack", but de gustibus ...
Thanks, Tom, for your help!
On behalf of Aidan LeBlanc aidan.w.leblanc@gmail.com:
I can confirm similar behaviour with this URL)&only-with-climatology=&download-timeseries=Timeseries&data-format=nc) (obtained through the UI following Aidan's filters):