Open yharahuts opened 9 months ago
I can't reproduce a crash in 6.2.12 with the loading script based on your schema:
even with the higher concurrency of 8:
# php ~/load_1891.php 500 8 10000000 1
preparing...
100% querying...
finished inserting
Total time: 832.53445100784
12012 docs per sec
mysql> select count(*) from redacted_aaa;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (0.00 sec)
There was a somewhat similar issue https://github.com/manticoresoftware/manticoresearch/issues/1458#issuecomment-1790605768 which has already been fixed. I suggest you check if the crash persists in the latest dev version - https://mnt.cr/dev/nightly
You can also try modifying the script, so it reproduces the crash, so we can reproduce it on our end to fix it.
@sanikolaev it is happening very randomly, I can load 100Gb of data without any probles at all, or have problems on 15Gb dataset at random point.
It is just like your comment on that issue:
It's also very unstable: sometimes the provided script works fine for the whole night, sometimes it crashes in a minute after started.
I'm currently testing manticoresearch/manticore:dev
image - but will need some (rather long) time to test with various data and confirm it is a duplicate and it is fixed.
I dont know if it helps but i managed to get into similar state with two vector fields and columnar engine in the same table
@MirosOwners Do you mean in the same table as in the script here https://github.com/manticoresoftware/manticoresearch/issues/1891#issuecomment-1971707674 ?
It stopped crashed on this index with dev version, but started to crash on other index. This time logs are clear, manticore just dies and starts again as if nothing happened.
Edit: as far as I can see, it just slowly overflows all available memory, Any ideas how to debug this?
I've tried adding flush ramchunk
during inserts, but no luck.
Edit: as far as I can see, it just slowly overflows all available memory, Any ideas how to debug this?
@yharahuts So it doesn't crash in the dev version, but just an OOM occurs?
I've been dealing with this sporadic problem for weeks now. I finally found this thread and after review, one comment stood out:
@MirosOwners
I dont know if it helps but i managed to get into similar state with two vector fields and columnar engine in the same table
Although I cannot attest to precisely when the problem started happening, I do know that I somewhat recently (weeks) added vector fields (3 of them, dim = 384, hnsw, l2, to my table.) I cannot recall having this problem before doing so, although I am not positive.
The problem occurs sporadically during large throughput indexing, whether its bulk API or not, the server crashes with signal 11, and upon restart and replaying binlog, also crashes (perpetual crash loop from there) I immediately implemented a sleep mechanism between the batches, which may help but it does not solve the issue. It does not occur when indexing small amounts and I can utilize these 3 vector fields during search time.
I seem to be able to reset it to a stable state with a rm -rf table_name, and then re-indexing smaller amounts or with sleeps added between calls but it appears like I can re-introduce the bug by just throwing data at the instance long enough (I re-index in developer environment frequently and sometimes test large datasets)
My initial test, which I don't see as confirmation, but it inspired me to post this message with this detail:
I just removed the three vector fields from being initialized in the RT CREATE TABLE
statement and successfully ran a total reindex on my data set with no sleep between batches of 100, taking 450 seconds total. This is only 23k rows total but some of the fields are large (~1 MB json file). I don't think anything spectacular is going on here - the dataset isnt even too big - 250MB . More important, regards the vector fields, whether or not I generate a value for the manticore index to consume, does not matter - in short, the existence of vector fields may actually have something to do with this, although I can't reason why that would be the case myself, just reporting it with some ~medium confidence that a dev should filter through the above info.
if you have a crash loop
and upon restart and replaying binlog, also crashes (perpetual crash loop from there)
it could be better to upload your index files along with binlog to reproduce that crash loop here and fix the issue. You could upload your data as described at manual https://manual.manticoresearch.com/dev/Reporting_bugs#Uploading-your-data
If maybe I just got lucky and the problem happens again, I can think on how to safely send anonymized data - need to plan the feasibility of the rest of the fields and still there is data that I do not wish to send.
Based on the details of my report: I just wanted to clarify again, although I can't make sense of it, that it points directly to the issue not being the data itself. I am utilizing the entire dataset fine after removing the 3 vector fields during table creation. Finally, in either case, the sporadic bug happens during indexing without even passing values for those vector fields.
I will test further and maybe run an even larger job and report back only if the problem begins again. If the bad state does not happen, since the only change made was not adding these fields, I can immediately pass over the create table statement, although it's just 384 dims, hnsw, l2 x3 vectors fields (and about ~25 other fields.)
@tomatolog or another: If you have a suspicion that the root cause is in fact the data - and i am thus missing something crucial about the code / architecture itself - I'd appreciate you clarifying that as well.
all crashes these could be reproduced locally are fixed already or on the way into master branch.
Our team do not have any clue what could cause such crash as we do not have data that reproduces the crash not the crash log from the searchd.log the crash stack could be checked.
Completely understand.
Is there a single viable thesis on the addition of vector fields to the table? I can try to go back and forth to assert further certainty on this being it. Other than that, I am not sure I can help atm with submitting data; will think on that more.
It would be much easier if we had at least one of the following:
Without these, it may be hard to resolve the issue. It's best to have all of them, as this significantly improves the chances of finding a solution.
Describe the bug Manticore crashes when inserting a large amount of data into index.
Manticore is running in
rt
mode with following tables:Data is inserted via (rather large?) batches of 500 records per single insert, and whole dataset contains about 100m rows splitted into 1-3 indexes. Crash happens randomly, data can be inserted without problems at all, or can crash at ~1-2% at random line.
Since it is prod instance, I'm afraid I can not give you our datasets, or test multiple (older?) manticore versions.
To Reproduce Steps to reproduce the behavior:
Expected behavior It should not crash.
Describe the environment:
searchd -v
: Manticore 6.2.12 dc5144d35@230822manticoresearch/manticore:6.2.12
Messages from log files:
docker logs
shows following:After that it restarts with:
Additional context While writing this issue, I came up with two ideas:
decrease batch size to maybe 50 rows per single insert;didn't helpI'll try both options, but since crash is happening randomly - I couldnt guarantee it will work or not.
Any advices is greatly appreciated,
indextool --check
on both indexes returns: