Closed starinacool closed 9 months ago
Even after removing all workload from the server these two threads keep consuming 99.9 CPU.
Please show the following at this moment:
top
vmstat 5
during a minuteshow threads option format=all
select * from @@system.sessions
show status
show table <name> status
of your table(s)I have observed a similar failure. Workers would go to 100%, the connection to the client would break (the client receives no response).
They were processing sphinx protocol requests querying a text field "all_childs" which can contain the words "child_1", "child_2", ... up to "child_18". These were the hanging queries I had to kill -9
:
@(all_childs) child_4 | child_5
@(all_childs) child_4 | child_5 @(all_childs) child_4 | child_5
(yes, duplicate expression)
@(all_childs) child_1 | child_2 | child_3 | child_16 | child_17 | child_18
@(all_childs) child_1 | child_2 | child_3 | child_16 | child_17 | child_18
@(all_childs) child_1 | child_2 | child_3 | child_16 | child_17 | child_18
RT indices were present, but the queries ran against a non-RT index.
strace
started from htop
showed no syscall activity on the crashed worker processes.
These unspecific queries match a good portion of 230k documents. Other, more specific queries did not crash.
After reading this issue, I set max_threads_per_query = 4
to lower my threads per query. ~No failing workers so far.~
UPDATE: this setting did not fix the issue for me.
Hardware: AMD Ryzen 9 5950X 16-Core Processor, 128 GB RAM OS: Debian Bookworm within a KVM VM, 16 vcores assigned (hyperthreading is enabled, so this is 16 of 32 possible vcores) and 16 GB RAM Config:
max_connections = 100
expansion_limit = 500
seamless_rotate = 1
collation_libc_locale = de_DE.UTF-8
network_timeout = 5m
qcache_max_bytes = 0
searchd: Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)
Here's some data journalctl.txt show-status.txt show-threads.txt system-sessions.txt vmstat-5.txt
@Korkman if you can stably reproduce it by running on of the @(all_childs)
queries, could you share your table files and your config with us by sending them to our write-only s3 storage - https://manual.manticoresearch.com/Reporting_bugs#Uploading-your-data ? If we can reproduce this issue on our side, we'll be able to fix it.
could you try to use head of the dev version as it has fixes of CPU limit during FT queries ?
@tomatolog @sanikolaev 6.2.13 a2af06ca3@240110 dev (columnar 2.2.5 1d1e432@231204) (secondary 2.2.5 1d1e432@231204) (knn 2.2.5 1d1e432@231204)
seems to work fine.
@tomatolog Would a workaround be possible in 6.2.12 or can this only be fixed with the release of 6.2.13?
you could set max_threads_per_query
for full-text with multiple OR terms to keep CPU under control at the 6.2.12
or use 6.2.13 as dev version soon be released into main repository
seems to work fine.
Thanks. I'm closing this issue then.
@starinacool feel free to reopen in case it doesn't work for you in the dev version or the upcoming release.
Describe the bug 2 of 16 worker threads go 99.9 CPU time when I try to change max_threads_per_query from 10 to 12 on a 16 core box. Even after removing all workload from the server these two threads keep consuming 99.9 CPU. Server cannot be stoped with systemctl stop manticore. Only kill -9 helps.
To Reproduce Steps to reproduce the behavior:
Expected behavior All worker threads working normaly.
Describe the environment: Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)
Messages from log files: [Sun Nov 26 06:31:34.042 2023] [634140] caught SIGTERM, shutting down [Sun Nov 26 06:31:39.550 2023] [634140] WARNING: still 2 alive tasks during shutdown, after 5.508 sec [Sun Nov 26 06:31:39.701 2023] [634153] rt: table listing_finished: ramchunk saved in 0.150 sec
Additional context Config: optimize_cutoff = 8 max_threads_per_query = 10
access_doclists=mmap access_hitlists=mmap network_timeout = 20 client_timeout = 300 seamless_rotate = 1 unlink_old = 1 max_packet_size = 64M max_filter_values = 65535 listen_backlog = 255 max_batch_queries = 32 subtree_docs_cache = 16M subtree_hits_cache = 32M
binlog_flush = 2
binlog_max_log_size = 128M
expansion_limit = 100
query_log_format = sphinxql collation_server = utf8_general_ci collation_libc_locale = ru_RU.UTF-8 query_log_min_msec = 200 predicted_time_costs = doc=64, hit=48, skip=2048, match=64