manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.97k stars 498 forks source link

Setting max_threads_per_query = 12 leads to 99.9 CPU load for two threads on 16 core box #1631

Closed starinacool closed 9 months ago

starinacool commented 10 months ago

Describe the bug 2 of 16 worker threads go 99.9 CPU time when I try to change max_threads_per_query from 10 to 12 on a 16 core box. Even after removing all workload from the server these two threads keep consuming 99.9 CPU. Server cannot be stoped with systemctl stop manticore. Only kill -9 helps.

To Reproduce Steps to reproduce the behavior:

  1. Setup a 16 core 32GB SSD box width RT index
  2. Load some data
  3. Change to max_threads_per_query = 12 , restart
  4. Add workload

Expected behavior All worker threads working normaly.

Describe the environment: Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)

Messages from log files: [Sun Nov 26 06:31:34.042 2023] [634140] caught SIGTERM, shutting down [Sun Nov 26 06:31:39.550 2023] [634140] WARNING: still 2 alive tasks during shutdown, after 5.508 sec [Sun Nov 26 06:31:39.701 2023] [634153] rt: table listing_finished: ramchunk saved in 0.150 sec

Additional context Config: optimize_cutoff = 8 max_threads_per_query = 10
access_doclists=mmap access_hitlists=mmap network_timeout = 20 client_timeout = 300 seamless_rotate = 1 unlink_old = 1 max_packet_size = 64M max_filter_values = 65535 listen_backlog = 255 max_batch_queries = 32 subtree_docs_cache = 16M subtree_hits_cache = 32M
binlog_flush = 2
binlog_max_log_size = 128M
expansion_limit = 100
query_log_format = sphinxql collation_server = utf8_general_ci collation_libc_locale = ru_RU.UTF-8 query_log_min_msec = 200 predicted_time_costs = doc=64, hit=48, skip=2048, match=64

sanikolaev commented 10 months ago

Even after removing all workload from the server these two threads keep consuming 99.9 CPU.

Please show the following at this moment:

Korkman commented 9 months ago

I have observed a similar failure. Workers would go to 100%, the connection to the client would break (the client receives no response).

They were processing sphinx protocol requests querying a text field "all_childs" which can contain the words "child_1", "child_2", ... up to "child_18". These were the hanging queries I had to kill -9:

@(all_childs) child_4 | child_5 @(all_childs) child_4 | child_5 @(all_childs) child_4 | child_5 (yes, duplicate expression) @(all_childs) child_1 | child_2 | child_3 | child_16 | child_17 | child_18 @(all_childs) child_1 | child_2 | child_3 | child_16 | child_17 | child_18 @(all_childs) child_1 | child_2 | child_3 | child_16 | child_17 | child_18

RT indices were present, but the queries ran against a non-RT index.

strace started from htop showed no syscall activity on the crashed worker processes.

These unspecific queries match a good portion of 230k documents. Other, more specific queries did not crash.

After reading this issue, I set max_threads_per_query = 4 to lower my threads per query. ~No failing workers so far.~ UPDATE: this setting did not fix the issue for me.

Hardware: AMD Ryzen 9 5950X 16-Core Processor, 128 GB RAM OS: Debian Bookworm within a KVM VM, 16 vcores assigned (hyperthreading is enabled, so this is 16 of 32 possible vcores) and 16 GB RAM Config:

max_connections = 100
expansion_limit = 500
seamless_rotate = 1
collation_libc_locale = de_DE.UTF-8
network_timeout = 5m
qcache_max_bytes = 0

searchd: Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)

Korkman commented 9 months ago

Here's some data grafik journalctl.txt show-status.txt show-threads.txt system-sessions.txt vmstat-5.txt

sanikolaev commented 9 months ago

@Korkman if you can stably reproduce it by running on of the @(all_childs) queries, could you share your table files and your config with us by sending them to our write-only s3 storage - https://manual.manticoresearch.com/Reporting_bugs#Uploading-your-data ? If we can reproduce this issue on our side, we'll be able to fix it.

tomatolog commented 9 months ago

could you try to use head of the dev version as it has fixes of CPU limit during FT queries ?

Korkman commented 9 months ago

@tomatolog @sanikolaev 6.2.13 a2af06ca3@240110 dev (columnar 2.2.5 1d1e432@231204) (secondary 2.2.5 1d1e432@231204) (knn 2.2.5 1d1e432@231204) seems to work fine.

@tomatolog Would a workaround be possible in 6.2.12 or can this only be fixed with the release of 6.2.13?

tomatolog commented 9 months ago

you could set max_threads_per_query for full-text with multiple OR terms to keep CPU under control at the 6.2.12 or use 6.2.13 as dev version soon be released into main repository

sanikolaev commented 9 months ago

seems to work fine.

Thanks. I'm closing this issue then.

@starinacool feel free to reopen in case it doesn't work for you in the dev version or the upcoming release.