icatproject / topcat

GUI to work with multiple ICAT and IDS servers.
Other
1 stars 9 forks source link

StatusCheck can ignore poll delays when non-Topcat exceptions are thrown #449

Closed brianritchie1312 closed 5 years ago

brianritchie1312 commented 5 years ago

There are some pathological situations where StatusCheck will ignore poll delays and call IDS operations on every iteration (once per second, per download). This can happen if the prepareData() or isPrepared() requests on (Topcat's) IdsClient fail with something other than a TopcatException.

(A TopcatException results in the download status being set to EXPIRED; the download is then no longer considered for updates, though it is still shown in the user (and admin) Downloads view.)

The causes are slightly different:

The simplest solution would be to treat all Exceptions the same as TopcatExceptions, and mark the download as EXPIRED. This means that a temporary glitch in the connection to the IDS would cause all new and in-progress downloads to be (irrevocably) set to EXPIRED; but that's the case for TopcatExceptions anyway (and I presume this was considered both necessary and acceptable at some point in the past - see below). Allowing re-tries (at intervals) would be more complicated and I'm not sure it would be worthwhile.

Re. "at some point in the past": see:

performCheck(): issue #138, commit 568caf9 (25 Jan 2016)

prepareDownload(): commit https://github.com/icatproject/topcat/commit/03059f3f4cbcc16018cf32613e03c3a07ed0a681 (17 Jul 2017)

brianritchie1312 commented 5 years ago

Analysis of topcat logs over the past few weeks (to 19th August) shows no evidence of runaway IdsClient polling. However, the few instances of performCheck IOExceptions are all prepareData timeouts rather than instantaneous failures.

Both prepareDownload() and performCheck() have specific handlers for NotFoundException before handling its parent TopcatException, and both simply log the error then ignore it. I suspect this was to monitor the behaviour and then to prevent it from causing the scheduler thread to crash, rather than trying to handle the situation properly.

I still consider that the correct thing to do is to expire the download for any exceptions.

brianritchie1312 commented 5 years ago

Looking further back in the DLS production topcat logs, I've found instances of runaway performCheck() calls (where IdsClient.isPrepared is failing with IOException "connection refused" exceptions); on 2018-11-06 the first occurrance lasted for 20 minutes, one entry per second. Later occurrances seem to vary between IOException and BadRequestException, both for "connection refused", but the latter caused the download to be Expired. I haven't found any examples since 2018-11-09, so this appears not to have happened since 2.3.2.2 was superseded; but I suspect it could still happen.

brianritchie1312 commented 5 years ago

I've just rediscovered the following remark from the Release Notes for 2.4.0:

The change made here probably undoes that. I don't intend to do anything about this, but if any production instances start reporting that downloads are being expired for no obvious reason, this may be worth reconsidering.