rubin-dp0 / Support

Submit Github Issues related to DP0
MIT License
1 stars 3 forks source link

[BUG] Intermittent lock-up of Qserv query processing #14

Closed fritzm closed 2 years ago

fritzm commented 2 years ago

Qserv query processing sometimes locks up, requiring manual intervention by a server operator to be "unstuck".

This has been reported as #11 and #13, and also variously in the Community support forum and on Slack. This ticket is intended to provide a place to accumulate information on further occurrences and for consolidated tracking of progress on the underlying issue.

frossie commented 2 years ago

For a longer explanation of this issue see https://community.lsst.org/t/intermittent-qserv-tap-query-service-issue-2021-07-13/5653

ktlim commented 2 years ago

Qserv got stuck on a SELECT * FROM object query just now. I restarted.

It appeared that there were a few other queries that were also stuck, including an object/truth_match join and two others that I believe came from the testing system.

fritzm commented 2 years ago

Okay, we believe this to be fixed with deployment of Qserv release 2021.7.1-rc1 this past Friday. The root cause was a resource deadlock in query distribution and result-accumulation code, at the interface between the Qserv head node and the underlying XRootD messaging/streaming library. Thank you for your patience while we tracked this one down!