vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.7k stars 2.1k forks source link

Bug Report: vttablet container OOMKilled at huge concurrent select query with consolidator #17243

Open jwangace opened 5 days ago

jwangace commented 5 days ago

Overview of the Issue

This is reproduce-able on k8s deployments, where runs vitess-operator with vitess v16. When consolidator been enabled, run select query concurrently at large scale through vtgate, and vttablet container get OOMKilled.

Consolidated Query Wait Count (vttablet_waits_count)

Screenshot 2024-11-16 at 7 59 54 AM

OOMKilled Metrics

Screenshot 2024-11-16 at 7 58 15 AM

Reproduction Steps

To easier reproduce this issue, you can: 1) set relatively small memory for vttablet container (limit at 1Gi for example) 2) craft a select query, and make the size of returned relatively large (5Mi for example) 3) run above select query concurrently at large scale through vtgate (10,000 queries for example) 4) observe vttablet OOMKilled

Binary Version

Vitess 16 and after versions.

/vt/bin$ ./vtgate --version
Version: 16.0.3-SNAPSHOT (Git revision 4335eaf8ce3fa328aacd36e66f4776bd5208c7c8 branch 'v16-hc-demonware') built on Tue Dec 12 18:02:03 UTC 2023 by vitess@buildkitsandbox using go1.20.5 linux/amd64

/vt/bin$ ./vttablet --version
Version: 16.0.3-SNAPSHOT (Git revision 4335eaf8ce3fa328aacd36e66f4776bd5208c7c8 branch 'v16-hc-demonware') built on Tue Dec 12 18:02:03 UTC 2023 by vitess@buildkitsandbox using go1.20.5 linux/amd64

Operating System and Environment details

kubernetes version: v1.27.11

Log Fragments

OOMKilled happens very quick before any log can be outputted.
shlomi-noach commented 4 days ago

where runs vitess-operator with vitess v16.

@jwangace thank you for the report! Seeing that v16 is unsupported, could you please clarify whether the bug still appeas on supported versions (v19, v20, v21 at this time)?

jwangace commented 4 days ago

Hi @shlomi-noach as you might have noticed, I also put a fix proposal PR in the latest code, unfortunately because we don't have any v22 deployments so I did not reproduce that on v22, however I cross compared related function (in which I proposed to update execSelect) and I believe this bug should present up to the current.

Do you think this is something PlantScale can verify by following Reproduction Steps?

shlomi-noach commented 3 days ago

@jwangace thank you, let us take a look!