Closed sshilko closed 6 years ago
Hi. Are you using our docker image and the default config or smth else?
I'am using your build image for building, except its alpine-3.7 instead of 3.6, then take binaries and move into my alpine + config yes, i'am in slack now if we can chat
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: Manticore 2.7.3 ad9b22d0@180928 release
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: Copyright (c) 2001-2016, Andrew Aksyonoff
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: Copyright (c) 2017-2018, Manticore Software LTD (http://manticoresearch.com)
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: using config file '/etc/sphinxsearch/sphinx.conf'...
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: WARNING: TCP fast open unavailable (can't read /proc/sys/net/ipv4/tcp_fastopen, unsuppo
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: listening on all interfaces, port=9312
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: listening on all interfaces, port=9212
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precaching index 'stream_test_0'
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_0.spl
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: lock /var/lib/sphinx/stream_test_0.spl success
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precaching index 'stream_test_1'
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_1.spl
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: lock /var/lib/sphinx/stream_test_1.spl success
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precaching index 'stream_test_2'
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: crond: crond (busybox 1.27.2) started, log level 6
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_2.spl
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: lock /var/lib/sphinx/stream_test_2.spl success
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precaching index 'stream_test_3'
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_3.spl
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: lock /var/lib/sphinx/stream_test_3.spl success
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precached 4 indexes in 0.007 sec
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: 'binlog_max_log_size' - nothing specified, using default value 268435456
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: 'read_unhinted' - nothing specified, using default value 0
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: 'read_buffer' - nothing specified, using default value 0
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: FATAL: failed to create rotation thread
Cured by
vm.overcommit_memory=1 instead of vm.overcommit_memory=2
https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
P.S. Carefull with memory usage, even if it says 90% free - sphinx tries to allocate alot on start i guess. Error message could be improved tho.
What AWS instance type do you use? How much memory it got? What is you config and indexes size?
AWS t2.small - VPC
total used free shared buff/cache available
Mem: 1993 218 1304 69 470 1418
Swap: 0 0 0
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
1e738da8XXXX 4.32% 144.6MiB / 1.947GiB 7.25% 0B / 0B 12.5MB / 71.2MB 4107
And this is beta - uses 144MB while having 0% real load (not production), the production with million rows uses 11.29MiB (docker ps) only under heavy load. (4 cpu 8gb ram ~50% load)
Changing workers from thread_pool to threads reduces memory usage to previous levels
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
53948e4ca4a8 2.19% 4.727MiB / 1.947GiB 0.24% 0B / 0B 32.8kB / 12.3kB 12
workers = thread_pool = 4105 pids, workers = threads = ~10 pids too
Reproducing memory issue
workers=threads sysctl vm.overcommit_memory=2
[root@ip-172-30-0-217 ec2-user]# free -m
total used free shared buff/cache available
Mem: 1993 131 1444 69 417 1554
indexes list
/var/lib/sphinx $ ls -lhS
total 248
-rw-r--r-- 1 dalek dalek 2.0K Oct 31 10:06 stream_test_0.sph
-rw-r--r-- 1 dalek dalek 2.0K Oct 31 10:06 stream_test_1.sph
-rw-r--r-- 1 dalek dalek 2.0K Oct 31 10:06 stream_test_2.sph
-rw-r--r-- 1 dalek dalek 2.0K Oct 31 10:06 stream_test_3.sph
-rw-r--r-- 1 dalek dalek 2.0K Oct 31 10:06 stream_test_delta_0.sph
-rw-r--r-- 1 dalek dalek 2.0K Oct 31 10:06 stream_test_delta_1.sph
-rw-r--r-- 1 dalek dalek 2.0K Oct 31 10:06 stream_test_delta_2.sph
-rw-r--r-- 1 dalek dalek 2.0K Oct 31 10:06 stream_test_delta_3.sph
-rw-r--r-- 1 dalek dalek 1.1K Oct 31 10:06 stream_test_0.spa
-rw-r--r-- 1 dalek dalek 968 Oct 31 10:06 stream_test_1.spa
-rw-r--r-- 1 dalek dalek 968 Oct 31 10:06 stream_test_2.spa
-rw-r--r-- 1 dalek dalek 834 Oct 31 10:06 stream_test_3.sps
-rw-r--r-- 1 dalek dalek 792 Oct 31 10:06 stream_test_3.spa
-rw-r--r-- 1 dalek dalek 773 Oct 31 10:06 stream_test_2.sps
-rw-r--r-- 1 dalek dalek 754 Oct 31 10:06 stream_test_0.sps
-rw-r--r-- 1 dalek dalek 720 Oct 31 10:06 stream_test_0.spm
-rw-r--r-- 1 dalek dalek 692 Oct 31 10:06 stream_test_2.spm
-rw-r--r-- 1 dalek dalek 616 Oct 31 10:06 stream_test_delta_2.spa
-rw-r--r-- 1 dalek dalek 588 Oct 31 10:06 stream_test_1.spm
total used free shared buff/cache available
Mem: 1993 129 1446 69 417 1556
Everything started OK
Now changing to workers=thread_pool sysctl vm.overcommit_memory=2
Immediately fails (while having 1.4Gb free) with indexes taking few megs
total used free shared buff/cache available
Mem: 1993 120 1454 68 418 1565
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: Manticore 2.7.3 ad9b22d0@180928 release
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: Copyright (c) 2001-2016, Andrew Aksyonoff
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: Copyright (c) 2017-2018, Manticore Software LTD (http://manticoresearch.com)
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: using config file '/etc/sphinxsearch/sphinx.conf'...
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: listening on all interfaces, port=9312
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: listening on all interfaces, port=9212
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precaching index 'stream_test_0'
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_0.spl
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: lock /var/lib/sphinx/stream_test_0.spl success
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precaching index 'stream_test_1'
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_1.spl
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: lock /var/lib/sphinx/stream_test_1.spl success
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precaching index 'stream_test_2'
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_2.spl
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: crond: crond (busybox 1.27.2) started, log level 6
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: lock /var/lib/sphinx/stream_test_2.spl success
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precaching index 'stream_test_3'
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_3.spl
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: lock /var/lib/sphinx/stream_test_3.spl success
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precached 4 indexes in 0.006 sec
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: 'binlog_max_log_size' - nothing specified, using default value 268435456
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: 'read_unhinted' - nothing specified, using default value 0
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: 'read_buffer' - nothing specified, using default value 0
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: FATAL: failed to create rotation thread
Oct 31 10:12:01 ip-172-30-0-217.ec2.internal systemd[1]: Failed to start Docker
There is nothing running inside container except searchd and cron to reindex. I suspect max_children=4096 might be related to 4107 pids running
max_children for thread_pool workers set number of workers daemon creates at pool of threads. You should set there values related to your box or comment out that option then daemon auto-calc its value.
main docker (4k pids, not connected directly by clients)
...
max_children = 4096
...
index stream_test
{
type = distributed
local = stream_test_0
local = stream_test_1
local = stream_test_2
local = stream_test_3
}
relay docker (all clients connect via this instance, its actually on same 127.0.0.1 as clients), 172.30.0.217 is main docker
...
max_children = 4096
...
index stream_test
{
type = distributed
agent_connect_timeout = 1500
agent_query_timeout = 3000
agent_persistent = 172.30.0.217:9312|172.30.0.217:9212:stream_test_0
agent_persistent = 172.30.0.217:9312|172.30.0.217:9212:stream_test_1
agent_persistent = 172.30.0.217:9312|172.30.0.217:9212:stream_test_2
agent_persistent = 172.30.0.217:9312|172.30.0.217:9212:stream_test_3
}
Pids issue is related to max_children
max_children=4096 docker stats PIDS
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
89a37248b868 2.19% 138.4MiB / 1.947GiB 6.94% 0B / 0B 0B / 177kB 4105
max_children=1096 docker stats PIDS
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
34ecf5fac5bf 4.38% 40.46MiB / 1.947GiB 2.03% 0B / 0B 0B / 26.1kB 1107
But when doing ps aux, i dont see any pids running (doing ps aux neither on host neither in container)
Docker version 17.06.2-ce, build 3dfb8343b139d6342acfd9975d7f1068b5b1c3d3
/ $ ps aux
PID USER TIME COMMAND
1 dalek 0:03 /usr/bin/searchd --nodetach --index stream_test --index stream_test_0 --index stream_test_1 --index stream_test_2 --index st
15 root 0:00 crond -l 6 -d 6 -b -L /dev/console
1292 dalek 0:00 /bin/sh
1296 dalek 0:00 ps aux
/ $ exit
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
34ecf5fac5bf 0.01% 40.63MiB / 1.947GiB 2.04% 0B / 0B 0B / 165kB 1105
@tomatolog not specifying max_children fixes the memory issue on start and PIDS maybe should be mentioned in documentation, probably on start it tries to start all max_children threads (runs out of memory with vm_overcommit=2)
Documentation could be fixed, or maybe there is issue in code - that it tries to create max_children threads on start.
for workers thread_pool daemon always creates N threads for thread pool. That N value calculated as
that is why in case you explicitly set that option then thread pool got started with 4096 threads (max_children = 4096), ie daemon creates such thread pool for you.
That is why I see no issue - just comment out that option or set value that fits your box.
You also might create PR with doc fixed for that section https://manticoresearch.gitlab.io/dev/conf_options_reference/searchd_program_configuration_options.html#max-children to make this option clear.
Created PR https://github.com/manticoresoftware/manticoresearch/pull/127
Thx, closing
Are you in the right place?
For general questions, post a question to Forum or Slack
If you open a Github issue, it must be a bug or feature request and you need to provide the below information.
Describe the environment
Amazon Linux 2
Manticore Search version: 2.7.3
OS version: Linux ip-172-30-0-217.ec2.internal 4.14.62-70.117.amzn2.x86_64 #1 SMP Fri Aug 10 20:14:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Describe the problem
When using workers = thread_pool getting this error, with workers = threads everything is fine. Cant reproduce locally, only on AWS EC2 inside docker. There are no other containers, no docker settings make any difference (ports, memory, host network)
Description of the issue:
Steps to reproduce:
Messsages from log files: FATAL: failed to create rotation thread