manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.89k stars 493 forks source link

Indexer crashes on merging main and delta indexes #1578

Open Grabien opened 10 months ago

Grabien commented 10 months ago

Describe the bug I have two indexes: main, which contains data from a large number of transcripts, and delta, which only contains fresh data from the last 24 hours. A cron script merges the data from the delta index into the main index every night. Sometimes the merging process stops working because the indexer crashes during it. To fix this issue, I have to completely rebuild the main index. However, after 3-5 days, the crash occurs again.

To Reproduce Steps to reproduce the behavior:

  1. Download transcripts and transcripts_delta indexes via the link below (warning, the size is 71.6 GB) https://cloud.grabien.com/s/JbbZMSqQrtNRDwf
  2. Run the command: /usr/bin/indexer --merge transcripts transcripts_delta --rotate

Describe the environment:

Messages from log files:

su - manticore -s /bin/bash -c "/usr/bin/indexer --merge transcripts transcripts_delta --rotate"
su: warning: cannot change directory to /home/manticore: No such file or directory
Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

using config file '/etc/manticoresearch/manticore.conf'...
merging table 'transcripts_delta' into table 'transcripts'...
*** Oops, indexer crashed! Please send the following report to developers.
Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)
-------------- report begins here ---------------
Current document: docid=0, hits=0
Current batch: minid=0, maxid=0
Hit pool start: docid=0, hit=0
-------------- backtrace begins here ---------------
Program compiled with Clang 15.0.7
Configured with flags: Configured with these definitions: -DDISTR_BUILD=bullseye -DUSE_SYSLOG=1 -DWITH_GALERA=1 -DWITH_RE2=1 -DWITH_RE2_FORCE_STATIC=1 -DWITH_STEMMER=1 -DWITH_STEMMER_FORCE_STATIC=1 -DWITH_NLJSON=1 -DWITH_UNIALGO=1 -DWITH_ICU=1 -DWITH_ICU_FORCE_STATIC=1 -DWITH_SSL=1 -DWITH_ZLIB=1 -DWITH_ZSTD=1 -DDL_ZSTD=1 -DZSTD_LIB=libzstd.so.1 -DWITH_CURL=1 -DDL_CURL=1 -DCURL_LIB=libcurl.so.4 -DWITH_ODBC=1 -DDL_ODBC=1 -DODBC_LIB=libodbc.so.2 -DWITH_EXPAT=1 -DDL_EXPAT=1 -DEXPAT_LIB=libexpat.so.1 -DWITH_ICONV=1 -DWITH_MYSQL=1 -DDL_MYSQL=1 -DMYSQL_LIB=libmariadb.so.3 -DWITH_POSTGRESQL=1 -DDL_POSTGRESQL=1 -DPOSTGRESQL_LIB=libpq.so.5 -DLOCALDATADIR=/var/lib/manticore -DFULL_SHARE_DIR=/usr/share/manticore
Built on Linux x86_64 (bullseye) (cross-compiled)
Stack bottom = 0x0, thread stack size = 0x20000
Trying system backtrace:
begin of system symbols:
/usr/bin/indexer(_Z12sphBacktraceib+0x22a)[0x5561e90a48ca]
/usr/bin/indexer(_Z7sigsegvi+0xbb)[0x5561e8fa059b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7f6458325140]
/usr/bin/indexer(_ZN14CSphDictReaderILb1EE4ReadEv+0xe)[0x5561e907f00e]
/usr/bin/indexer(_ZN13CSphIndex_VLN10MergeWordsI16DiskIndexQword_cILb1ELb0EES2_EEbPKS_S4_11VecTraits_TIjES6_P14CSphHitBuilderR10CSphStringR17CSphIndexProgress+0x6a8)[0x5561e905fc88]
/usr/bin/indexer(_ZN13CSphIndex_VLN7DoMergeEPKS_S1_PK10ISphFilterR10CSphStringR17CSphIndexProgressbb+0x642)[0x5561e8fc2ff2]
/usr/bin/indexer(_ZN13CSphIndex_VLN5MergeEP9CSphIndexRK11VecTraits_TI18CSphFilterSettingsEbR17CSphIndexProgress+0x110)[0x5561e8fc2280]
/usr/bin/indexer(_Z7DoMergeRK17CSphConfigSectionPKcS1_S3_RN3sph8Vector_TI18CSphFilterSettingsNS4_13DefaultCopy_TIS6_EENS4_14DefaultRelimitENS4_16DefaultStorage_TIS6_EEEEbb+0xb13)[0x5561e8f9fb13]
/usr/bin/indexer(main+0x31a1)[0x5561e8fa3e51]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7f6458161d0a]
/usr/bin/indexer(_start+0x2a)[0x5561e8f94f0a]
Trying boost backtrace:
 0# sphBacktrace(int, bool) in /usr/bin/indexer
 1# sigsegv(int) in /usr/bin/indexer
 2# 0x00007F6458325140 in /lib/x86_64-linux-gnu/libpthread.so.0
 3# CSphDictReader<true>::Read() in /usr/bin/indexer
 4# bool CSphIndex_VLN::MergeWords<DiskIndexQword_c<true, false>, DiskIndexQword_c<true, false> >(CSphIndex_VLN const*, CSphIndex_VLN const*, VecTraits_T<unsigned int>, VecTraits_T<unsigned int>, CSphHitBuilder*, CSphString&, CSphIndexProgress&) in /usr/bin/indexer
 5# CSphIndex_VLN::DoMerge(CSphIndex_VLN const*, CSphIndex_VLN const*, ISphFilter const*, CSphString&, CSphIndexProgress&, bool, bool) in /usr/bin/indexer
 6# CSphIndex_VLN::Merge(CSphIndex*, VecTraits_T<CSphFilterSettings> const&, bool, CSphIndexProgress&) in /usr/bin/indexer
 7# DoMerge(CSphConfigSection const&, char const*, CSphConfigSection const&, char const*, sph::Vector_T<CSphFilterSettings, sph::DefaultCopy_T<CSphFilterSettings>, sph::DefaultRelimit, sph::DefaultStorage_T<CSphFilterSettings> >&, bool, bool) in /usr/bin/indexer
 8# main in /usr/bin/indexer
 9# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
10# _start in /usr/bin/indexer

-------------- backtrace ends here ---------------
Please, create a bug report in our bug tracker (https://github.com/manticoresoftware/manticore/issues)
and attach there:
a) searchd log, b) searchd binary, c) searchd symbols.
Look into the chapter 'Reporting bugs' in the manual
(https://manual.manticoresearch.com/Reporting_bugs)
Dump with GDB via watchdog

UPDATE 2024 Feb 13

MRE is here https://github.com/manticoresoftware/manticoresearch/issues/1578#issuecomment-1926477102

tomatolog commented 10 months ago

could you check your data with the following commands

indextool --check transcripts
indextool --check transcripts_delta 

and provide the commands result?

Grabien commented 10 months ago

This command displays some errors for the "transcripts" index:

indextool --check transcripts

Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

using config file '/etc/manticoresearch/manticore.conf'...
checking table 'transcripts'...
checking schema...
checking dictionary...
FAILED, wrong word-delta (pos=1, word=, len=0, begin=193, delta=112)
FAILED, empty word in dictionary (pos=1)
FAILED, wrong word-delta (pos=124, word=, len=0, begin=9, delta=1)
FAILED, empty word in dictionary (pos=124)
FAILED, wrong word-delta (pos=132, word=, len=0, begin=10, delta=1)
FAILED, empty word in dictionary (pos=132)
FAILED, wrong word-delta (pos=140, word=, len=0, begin=11, delta=1)
FAILED, empty word in dictionary (pos=140)
FAILED, wrong word-delta (pos=151, word=, len=0, begin=12, delta=2)
FAILED, empty word in dictionary (pos=151)
FAILED, wrong word-delta (pos=160, word=, len=0, begin=14, delta=2)
FAILED, empty word in dictionary (pos=160)
FAILED, wrong word-delta (pos=169, word=, len=0, begin=16, delta=1)
FAILED, empty word in dictionary (pos=169)
FAILED, wrong word-delta (pos=178, word=, len=0, begin=17, delta=1)
FAILED, empty word in dictionary (pos=178)
FAILED, wrong word-delta (pos=187, word=, len=0, begin=18, delta=3)
FAILED, empty word in dictionary (pos=187)
FAILED, wrong word-delta (pos=198, word=, len=0, begin=21, delta=3)
FAILED, empty word in dictionary (pos=198)
FAILED, wrong word-delta (pos=209, word=, len=0, begin=24, delta=2)
FAILED, empty word in dictionary (pos=209)
FAILED, wrong word-delta (pos=219, word=, len=0, begin=26, delta=6)
FAILED, empty word in dictionary (pos=219)
FAILED, wrong word-delta (pos=233, word=, len=0, begin=28, delta=4)
FAILED, empty word in dictionary (pos=233)
FAILED, wrong word-delta (pos=245, word=, len=0, begin=21, delta=1)
FAILED, empty word in dictionary (pos=245)
FAILED, wrong word-delta (pos=254, word=, len=0, begin=20, delta=14)
FAILED, empty word in dictionary (pos=254)
FAILED, wrong word-delta (pos=276, word=, len=0, begin=19, delta=1)
FAILED, empty word in dictionary (pos=276)
FAILED, wrong word-delta (pos=285, word=, len=0, begin=16, delta=17)
FAILED, empty word in dictionary (pos=285)
FAILED, wrong word-delta (pos=310, word=, len=0, begin=15, delta=1)
FAILED, empty word in dictionary (pos=310)
FAILED, wrong word-delta (pos=318, word=, len=0, begin=15, delta=4)
FAILED, empty word in dictionary (pos=318)
FAILED, wrong word-delta (pos=329, word=, len=0, begin=14, delta=2)
FAILED, empty word in dictionary (pos=329)
FAILED, wrong word-delta (pos=338, word=, len=0, begin=13, delta=3)
FAILED, empty word in dictionary (pos=338)
FAILED, wrong word-delta (pos=348, word=, len=0, begin=12, delta=6)
FAILED, empty word in dictionary (pos=348)
FAILED, wrong word-delta (pos=361, word=, len=0, begin=12, delta=1)
FAILED, empty word in dictionary (pos=361)
FAILED, wrong word-delta (pos=369, word=, len=0, begin=12, delta=6)
FAILED, empty word in dictionary (pos=369)
FAILED, wrong word-delta (pos=382, word=, len=0, begin=11, delta=4)
FAILED, empty word in dictionary (pos=382)
FAILED, wrong word-delta (pos=393, word=, len=0, begin=15, delta=4)
FAILED, empty word in dictionary (pos=393)
FAILED, wrong word-delta (pos=404, word=, len=0, begin=11, delta=9)
FAILED, empty word in dictionary (pos=404)
FAILED, wrong word-delta (pos=421, word=, len=0, begin=11, delta=13)
FAILED, empty word in dictionary (pos=421)
FAILED, wrong word-delta (pos=442, word=, len=0, begin=11, delta=17)
FAILED, empty word in dictionary (pos=442)
FAILED, wrong word-delta (pos=467, word=, len=0, begin=10, delta=1)
FAILED, empty word in dictionary (pos=467)
FAILED, wrong word-delta (pos=475, word=, len=0, begin=10, delta=18)
FAILED, empty word in dictionary (pos=475)
FAILED, wrong word-delta (pos=501, word=, len=0, begin=10, delta=12)
FAILED, empty word in dictionary (pos=501)
FAILED, wrong word-delta (pos=521, word=, len=0, begin=10, delta=12)
FAILED, empty word in dictionary (pos=521)
FAILED, wrong word-delta (pos=541, word=, len=0, begin=10, delta=8)
FAILED, empty word in dictionary (pos=541)
FAILED, wrong word-delta (pos=556, word=, len=0, begin=10, delta=5)
FAILED, empty word in dictionary (pos=556)
FAILED, wrong word-delta (pos=568, word=, len=0, begin=9, delta=1)
FAILED, empty word in dictionary (pos=568)
FAILED, wrong word-delta (pos=576, word=, len=0, begin=10, delta=2)
FAILED, empty word in dictionary (pos=576)
FAILED, wrong word-delta (pos=585, word=, len=0, begin=8, delta=1)
FAILED, empty word in dictionary (pos=585)
FAILED, wrong word-delta (pos=593, word=, len=0, begin=8, delta=4)
FAILED, empty word in dictionary (pos=593)
FAILED, wrong word-delta (pos=604, word=, len=0, begin=8, delta=2)
FAILED, empty word in dictionary (pos=604)
FAILED, wrong word-delta (pos=613, word=, len=0, begin=8, delta=1)
FAILED, empty word in dictionary (pos=613)
FAILED, wrong word-delta (pos=621, word=, len=0, begin=8, delta=8)
FAILED, empty word in dictionary (pos=621)
FAILED, wrong word-delta (pos=636, word=, len=0, begin=8, delta=3)
FAILED, empty word in dictionary (pos=636)
FAILED, wrong word-delta (pos=646, word=, len=0, begin=10, delta=4)
FAILED, empty word in dictionary (pos=646)
FAILED, wrong word-delta (pos=657, word=, len=0, begin=8, delta=4)
FAILED, empty word in dictionary (pos=657)
FAILED, wrong word-delta (pos=668, word=, len=0, begin=12, delta=16)
FAILED, empty word in dictionary (pos=668)
FAILED, wrong word-delta (pos=692, word=, len=0, begin=8, delta=8)
FAILED, empty word in dictionary (pos=692)
FAILED, wrong word-delta (pos=707, word=, len=0, begin=10, delta=6)
FAILED, empty word in dictionary (pos=707)
FAILED, wrong word-delta (pos=720, word=, len=0, begin=11, delta=5)
FAILED, empty word in dictionary (pos=720)
FAILED, wrong word-delta (pos=732, word=, len=0, begin=12, delta=4)
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check FAILED, 99 of 166363 failures reported, 716.0 sec elapsed

indextool --check transcripts_delta

Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

using config file '/etc/manticoresearch/manticore.conf'...
checking table 'transcripts_delta'...
checking schema...
checking dictionary...
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check passed, 0.9 sec elapsed
tomatolog commented 10 months ago

seems you mains index is invalid and should be reinfected from scratch

Grabien commented 10 months ago

Yes, after completely rebuilding the "transcripts" index, the merging with the delta index works well for 3-5 days. But then the same issue happens again: the index is broken, merging is crashing, and I have to start over.

tomatolog commented 10 months ago

then you need change pipeline to backup indexes before the merge operation then issues merge as

sudo -u manticore indexer --rotate --nohup --merge main delta
indextool --rotate --check main

this way indexer will merge data but will not send signal to daemon then indextool check the main index and sends signal to daemon if index is valid

this way after main got invalid you could provide main and delta indexes you backed up for investigation of how merge creates bad index

Grabien commented 10 months ago

I'm sorry, I don't quite understand how my merging script should look like. I changed it to the following:

/usr/bin/indexer --merge transcripts transcripts_delta --rotate --nohup
/usr/bin/indextool --rotate --check transcripts

But when I run it, I receive these errors:

WARNING: Index header format is not json, will try it as binary...
WARNING: Unable to load header... Error failed to open /var/lib/manticore/transcripts.tmp.sph: No such file or directory
FATAL: table 'transcripts': prealloc failed: failed to open /var/lib/manticore/transcripts.tmp.sph: No such file or directory
tomatolog commented 10 months ago

there is an example of --nohup cli at out manual indexer cli section and it shows the same command sequence.

Need to check the case by my own. I was sure it will work the way as described in manual.

Grabien commented 10 months ago

Yesterday, merging failed again. Below is the link to the full set of indexes: old main index, delta index, and temp files of an incomplete merge. I hope it will be useful to find the reason for this crashing. Please let me know if I can provide any other additional information. This bug is very annoying.

https://cloud.grabien.com/s/HrAgAMYA3iBDL5t (115 GB!)

sanikolaev commented 10 months ago

Thanks. I've started downloading the archive on our dev server.

sanikolaev commented 10 months ago

@PavelShilin89 pls try to reproduce the issue on dev2. Once downloaded (in a couple of hours), the archive will be at /home/snikolaev/indexes.zip.

sanikolaev commented 10 months ago

Below is the link to the full set of indexes: old main index, delta index, and temp files of an incomplete merge

Can we have your config too please?

Grabien commented 10 months ago

This is our configuration file for these indexes:

searchd
{
  listen = 9312:sphinx
  listen = 9306:mysql41
  listen = 9308:http
  log = /var/log/manticore/searchd.log
  query_log = /var/log/manticore/query.log
  pid_file = /var/run/manticore/searchd.pid
  query_log_format = sphinxql
  network_timeout = 30
}

indexer
{
  max_file_field_buffer = 16M
  mem_limit = 1024M
}

source database
{
  type = mysql
  sql_host = ...
  sql_user = ...
  sql_pass = ...
  sql_db = ...
  sql_query_pre = set names utf8
  sql_query_pre = set character set utf8
  sql_query_pre = set session long_query_time = 600
  sql_query_pre = set session wait_timeout = 600
}

source transcripts : database
{
  sql_query_pre = update sm_sphinxcounters set lastid = (select max(id) from sm_transcripts) where indexname = 'transcripts'
  sql_query_range = select min(id), max(id) from sm_transcripts
  sql_range_step = 5000
  sql_file_field = filename
  sql_query = select id, id as docid, title, concat('/mnt/mirror/media/transcripts/', lpad(floor(id / 1000), 4, '0'), '/', id, '.', format) as filename from sm_transcripts where id >= $start and id <= $end and status = 'Active'
  sql_attr_uint = docid
}

source transcripts_delta : database
{
  sql_file_field = filename
  sql_query = select id, id as docid, title, concat('/mnt/mirror/media/transcripts/', lpad(floor(id / 1000), 4, '0'), '/', id, '.', format) as filename from sm_transcripts where id > (select lastid from sm_sphinxcounters where indexname = 'transcripts') and status = 'Active'
  sql_attr_uint = docid
}

index transcripts
{
  source = transcripts
  path = /var/lib/manticore/transcripts
}

index transcripts_delta
{
  source = transcripts_delta
  path = /var/lib/manticore/transcripts_delta
}
sanikolaev commented 10 months ago

The delta table is ok:

snikolaev@dev2:~/115GB$ indextool -c manticore.conf --check transcripts_delta
Manticore 6.2.13 01c4e054a@231103 dev (columnar 2.2.5 b8be4eb@230928) (secondary 2.2.5 b8be4eb@230928)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

using config file '/home/snikolaev/115GB/manticore.conf'...
checking table 'transcripts_delta'...
checking schema...
checking dictionary...
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check passed, 2.2 sec elapsed

, but the larger one is corrupted:

snikolaev@dev2:~/115GB$ indextool -c manticore.conf --check transcripts
Manticore 6.2.13 01c4e054a@231103 dev (columnar 2.2.5 b8be4eb@230928) (secondary 2.2.5 b8be4eb@230928)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

using config file '/home/snikolaev/115GB/manticore.conf'...
checking table 'transcripts'...
checking schema...
checking dictionary...
FAILED, invalid docs/hits (pos=12, word=00, docs=1756020, hits=-2134759148)
checking data...
FAILED, rowid out of bounds (wordid=0(0), rowid=6580323)
FAILED, hit entries sorting order decreased (wordid=0(0), rowid=0, hit=16777592, last=16780657)
FAILED, hit decreased (wordid=0(0), rowid=0, hit=376, last=3441)
FAILED, rowid out of bounds (wordid=0(0), rowid=16843009)
FAILED, hit entries sorting order decreased (wordid=0(0), rowid=16843009, hit=16778765, last=16781723)
FAILED, hit decreased (wordid=0(0), rowid=16843009, hit=1549, last=4507)
FAILED, hit entries sorting order decreased (wordid=0(0), rowid=16843009, hit=16784701, last=16786074)
FAILED, hit decreased (wordid=0(0), rowid=16843009, hit=7485, last=8858)

This is a likely reason why indexer --merge failed. Can you please:

?

Grabien commented 10 months ago

Okay, I will do it manually every day. If the merging fails, I will send you all the indexes again.

Grabien commented 10 months ago

I recreated the main index from scratch and ran indextool afterwards. I immediately see one failed item in the output. Does this mean that the index is already broken?

Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

using config file '/etc/manticoresearch/manticore.conf'...
checking table 'transcripts'...
checking schema...
checking dictionary...
FAILED, invalid docs/hits (pos=20, word=00, docs=1769514, hits=-2126401355)
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check FAILED, 1 failures reported, 820.2 sec elapsed
tomatolog commented 10 months ago

yes seems main got broken from the indexing

Could you provide your source data along with config to reproduce issue here locally?

Grabien commented 10 months ago

Could you please send me an e-mail to max@grabien.com? I will send you the links to our data.

tomatolog commented 10 months ago

you could mail these into dev@manticoresearch.com or you could upload the data as described at our manual https://manual.manticoresearch.com/Reporting_bugs#Uploading-your-data

sanikolaev commented 10 months ago

@PavelShilin89 pls find @Grabien's email sent to dev@manticoresearch.com and prepare an MRE.

PavelShilin89 commented 9 months ago

I have run the indexer on both dev version and release version 6.2.12, crash does not reproduce. On dev version 6.2.13 there are no errors, no crash, only warnings. On release version 6.2.12, no crash, but an error occurs. I also increased the timeouts time, for correct indexing.

Here is my configuration file:

searchd
{
  listen = 59312:sphinx
  listen = 59306:mysql41
  listen = 59308:http
  log = /home/pavel/issue-1578/manticore/searchd.log
  query_log = /home/pavel/issue-1578/manticore/query.log
  pid_file = /home/pavel/issue-1578/manticore/searchd.pid
  query_log_format = sphinxql
  network_timeout = 600
}

indexer
{
  max_file_field_buffer = 16M
  mem_limit = 1024M
}

source database
{
  type = mysql
  sql_host = localhost
  sql_user = test
  sql_pass =
  sql_db = test
  sql_query_pre = set names utf8
  sql_query_pre = set character set utf8
  sql_query_pre = set session long_query_time = 3000
  sql_query_pre = set session wait_timeout = 3000
}

source transcripts : database
{
  sql_query_pre = update sm_sphinxcounters set lastid = (select max(id) from sm_transcripts) where indexname = 'transcripts'
  sql_query_range = select min(id), max(id) from sm_transcripts
  sql_range_step = 5000
  sql_file_field = filename
  sql_query = select id, id as docid, title, concat('/home/pavel/issue-1578/transcripts/', lpad(floor(id / 1000), 4, '0'), '/', id, '.', format) as filename from sm_transcripts where id >= $start and id <= $end and status = 'Active'
  sql_attr_uint = docid
}

source transcripts_delta : database
{
  sql_file_field = filename
  sql_query = select id, id as docid, title, concat('/home/pavel/issue-1578/transcripts/', lpad(floor(id / 1000), 4, '0'), '/', id, '.', format) as filename from sm_transcripts where id > (select lastid from sm_sphinxcounters where indexname = 'transcripts') and status = 'Active'
  sql_attr_uint = docid
}

index transcripts
{
  source = transcripts
  path = /home/pavel/issue-1578/transcripts/transcripts
}

index transcripts_delta
{
  source = transcripts_delta
  path = /home/pavel/issue-1578/transcripts/transcripts_delta
}

Logs on version 6.2.12:

pavel@dev2:~/issue-1578$ ./indexer -c manticore.conf --all
Manticore 6.2.12 dc5144d35@230822
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

WARNING: Error initializing columnar storage: daemon requires columnar library v21 (trying to load v24)
WARNING: Error initializing secondary index: daemon requires secondary library v10 (trying to load v13)
using config file '/home/pavel/issue-1578/manticore.conf'...
indexing table 'transcripts'...
ERROR: table 'transcripts': sql_fetch_row: Lost connection to MySQL server during query.
total 442699 docs, 41599553655 bytes
total 7692.191 sec, 5408023 bytes/sec, 57.55 docs/sec
indexing table 'transcripts_delta'...
collected 0 docs, 0.0 MB
total 0 docs, 0 bytes
total 4.738 sec, 0 bytes/sec, 0.00 docs/sec
total 442699 reads, 5505.621 sec, 91.7 kb/call avg, 12.4 msec/call avg
total 126230 writes, 40.445 sec, 333.9 kb/call avg, 0.3 msec/call avg

Logs on version 6.2.13:

WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728613.srt: No such file or directory
  Usage of /:                       48.3% of 7.12TB
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728617.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728619.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728621.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728623.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728625.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728627.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728629.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728631.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728633.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728635.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728637.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728639.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728641.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728643.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728645.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728647.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728649.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728651.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728653.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728655.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728657.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728659.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728661.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728663.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728665.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728667.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728669.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728671.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728673.srt: No such file or directory
WARNING: failed to open /home/pavel/issue-1578/transcripts/3728/3728675.srt: No such file or directory
collected 1774552 docs, 96019.7 MB
creating secondary index
creating lookup: 1774.5 Kdocs, 100.0% done
sorted 20280.6 Mhits, 100.0% done
WARNING: table 'transcripts': failed to open /home/pavel/issue-1578/transcripts/3728/3728675.srt: No such file or directory.
total 1774552 docs, 96019756610 bytes
total 24149.946 sec, 3975982 bytes/sec, 73.48 docs/sec
indexing table 'transcripts_delta'...
collected 0 docs, 0.0 MB
creating secondary index
total 0 docs, 0 bytes
total 0.532 sec, 0 bytes/sec, 0.00 docs/sec
total 1803818 reads, 15395.620 sec, 77.4 kb/call avg, 8.5 msec/call avg
total 347138 writes, 183.746 sec, 437.2 kb/call avg, 0.5 msec/call avg
Grabien commented 8 months ago

I tried to increase timeouts as in the config provided above, but unfortunately, there were no changes. The same crash in 2-3 days.

sanikolaev commented 8 months ago

@PavelShilin89 looks like you reproduced the issue, but didn't notice it, because you didn't run indextool --check:

snikolaev@dev2:/home/pavel/issue-1578$ indextool -c manticore.conf --check transcripts
Manticore 6.2.13 e80d505b9@240103 dev (columnar 2.2.5 1d1e432@231204) (secondary 2.2.5 1d1e432@231204)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

using config file '/home/pavel/issue-1578/manticore.conf'...
checking table 'transcripts'...
checking schema...
checking dictionary...
FAILED, invalid docs/hits (pos=20, word=00, docs=1773341, hits=-2123101169)
checking data...

Please try to localize it now.

PavelShilin89 commented 7 months ago

@sanikolaev The problem is really only reproduced on full data volume, when the data volume is reduced everything is correct. Also, after starting the indexer you need to check indextool -c manticore.conf --check transcripts. Logs:

pavel@dev2:~/issue-1578$ indextool -c manticore.conf --check transcripts
Manticore 6.2.13 978d5656c@24012517 dev (columnar 2.2.5 214ce90@240115) (secondary 2.2.5 214ce90@240115)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2024, Manticore Software LTD (https://manticoresearch.com)

using config file '/home/pavel/issue-1578/manticore.conf'...
checking table 'transcripts'...
checking schema...
checking dictionary...
FAILED, invalid docs/hits (pos=20, word=00, docs=1760111, hits=-2131936544)
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check FAILED, 1 failures reported, 2358.5 sec elapsed
PavelShilin89 commented 7 months ago

I have verified that the problem is reproducible, but only with large amounts of data. When checking a certain part of the data or reducing the volume, everything works correctly.

MRE

Steps to reproduce:

  1. Log in to the remote server:
    ssh {yourname}@dev2.manticoresearch.com
  2. Log in to the folder:
    cd /home/pavel/issue-1578
  3. Since it takes a long time to execute, you need to start a screen session:
    screen -x {name}
  4. Run the command:
    indexer -c manticore.conf --all
  5. Run the command:
    indextool -c manticore.conf --check transcripts

Logs:

pavel@dev2:~/issue-1578$ indextool -c manticore.conf --check transcripts
Manticore 6.2.13 978d5656c@24012517 dev (columnar 2.2.5 214ce90@240115) (secondary 2.2.5 214ce90@240115)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2024, Manticore Software LTD (https://manticoresearch.com)

using config file '/home/pavel/issue-1578/manticore.conf'...
checking table 'transcripts'...
checking schema...
checking dictionary...
FAILED, invalid docs/hits (pos=20, word=00, docs=1760111, hits=-2131936544)
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check FAILED, 1 failures reported, 2358.5 sec elapsed
Grabien commented 7 months ago

It seems that it is not related to the amount of data. Lately, the same issue is happening with another index, which is almost 40 times smaller. If it would simplify testing, I can also send all the files of this index.

sanikolaev commented 7 months ago

@Grabien it would be very helpful. Please do

Grabien commented 7 months ago

@sanikolaev I have just sent all the information to your email.

sanikolaev commented 7 months ago

@Grabien, I am unable to reproduce the issue with the new data files/tables. Running indextool -c manti.conf --check transcripts does not reveal any corruption in the tables, neither before nor after merging them. Could you provide more detailed instructions on how to reproduce the issue using the new files?

Grabien commented 7 months ago

@sanikolaev It's strange, but I was also not able to reproduce the issue with this data anymore. I will keep testing and let you know once I have any new information.

klirichek commented 5 months ago

Tried several times on with different builds and hardware:

6.2.12 (release, fresh build) 6.2.12 (release, copied file which was reported as reproducable) rev ddd7c3ed (master)

on x86_64 on M2 (arm64)

no corruption revealed; nothing to fix. Used test data from https://github.com/manticoresoftware/manticoresearch/issues/1578#issuecomment-1926477102

klirichek commented 5 months ago

tried current master with following results:

pavel@dev2:~/issue-1578$ time indexer -c manticore.conf transcripts
Manticore 6.2.13 7ecf541ab@24041615 dev (columnar 2.2.5 b4f7386@240405) (secondary 2.2.5 b4f7386@240405) (knn 2.2.5 b4f7386@240405)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2024, Manticore Software LTD (https://manticoresearch.com)

using config file '/home/pavel/issue-1578/manticore.conf'...
indexing table 'transcripts'...
collected 1409687 docs, 85066.8 MB
creating secondary index
creating lookup: 1409.6 Kdocs, 100.0% done
sorted 17942.5 Mhits, 100.0% done
total 1409687 docs, 85066857598 bytes
total 21945.242 sec, 3876323 bytes/sec, 64.23 docs/sec
total 1433399 reads, 13545.208 sec, 86.0 kb/call avg, 9.4 msec/call avg
total 91279 writes, 217.079 sec, 1463.8 kb/call avg, 2.3 msec/call avg

real    365m45.647s
user    132m1.250s
sys 3m49.842s
pavel@dev2:~/issue-1578$ indextool --check transcripts
Manticore 6.2.13 7ecf541ab@24041615 dev (columnar 2.2.5 b4f7386@240405) (secondary 2.2.5 b4f7386@240405)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2024, Manticore Software LTD (https://manticoresearch.com)

using config file '/home/pavel/issue-1578/manticore.conf'...
checking table 'transcripts'...
checking schema...
checking dictionary...
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check passed, 863.4 sec elapsed

problem still not reproduced. So, MRE looks not actual

BTW, for real run it is necessary to run 'su pavel', else reported MRE fails on indexing because can't access to .spl file.

sanikolaev commented 5 months ago

problem still not reproduced. So, MRE looks not actual

@PavelShilin89 pls prepare a better MRE or confirm the problem is solved.

klirichek commented 5 months ago

I've run one more time (still in progress) to be twice sure. On the origin, on dev2.

Comparing 2 reports I see that original check run >2000s, but my lates on the same hardware took <900s. Maybe it means, system was busy, and overall business someway affects the result, but this is just a guess.

klirichek commented 5 months ago

last control check - indexing done (take ~6 hours), checking done, no problems revealed

pavel@dev2:~/issue-1578$ indextool -c manticore.conf  --check transcripts
Manticore 6.2.13 7ecf541ab@24041615 dev (columnar 2.2.5 b4f7386@240405) (secondary 2.2.5 b4f7386@240405)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2024, Manticore Software LTD (https://manticoresearch.com)

using config file '/home/pavel/issue-1578/manticore.conf'...
checking table 'transcripts'...
checking schema...
checking dictionary...
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check passed, 918.1 sec elapsed

One extra thing - hashes of index files (m.b. faster to compare, then run indextool)

a172ffca50a20b20c4a76364913403a8 *transcripts.spa
1cd74f69280aa5aa7ef6476f62e88402 *transcripts.spd
80dd94cae7da4c1efbc10a549a3a52f1 *transcripts.spds
cc4ea1666238c44060f5f1f1452e4d53 *transcripts.spe
39d268632ac3fa39cc32c628e68978f5 *transcripts.sphi
609b897389df7708d58d63649646b2c6 *transcripts.spi
a31d1ae0a56473f262ebe1b5ef4e0bb6 *transcripts.spidx
b5a3c96a4ce2d9a9bba4945490730f9a *transcripts.spm
acd9f1dd531d59ae9ae77011881e03c7 *transcripts.spp
04845ecb3a17d4af6694c0116096dad7 *transcripts.spt

(sph is excluded, since it has timestamp inside, so hash will be different each attempt). These hashes persist all when all kind of indexing (on dev, on another intel host, on mac M2).

PavelShilin89 commented 4 months ago

@sanikolaev I tried for a long time to reproduce the process on dev2, in order to get an error, but never reproduced it. I compared the hashes of the index files many times, and they always matched too. As an experiment, I tried changing the step range to 50, 150, 200, 250, 300, this also did not give the error. I have no more ideas how one can deliberately affect the process and get an error.

sanikolaev commented 4 months ago

changing the step range to 50, 150, 200, 250, 300, this also did not give the error. I have no more ideas

You originally reproduced it with step 5000 here https://github.com/manticoresoftware/manticoresearch/issues/1578#issuecomment-1869033529 (rel. comment https://github.com/manticoresoftware/manticoresearch/issues/1578#issuecomment-1882264575)

Try to replicate exactly what you did back then.

PavelShilin89 commented 4 months ago

@sanikolaev I have tested all ways to get this error, but have never been able to reproduce it. Maybe there are other ideas how to reproduce this error?