Closed ppodolsky closed 3 years ago
My 50 cent in research: I have pure Python library for reading Tantivy store files: https://github.com/izihawa/tantipy I've updated it yesterday to make it able to read new store format with skip_index. Now this library takes last layer in skip_index, takes compressed blocks offsets and reads&unpacks them.
So, I took the library and opened the .store file. File has been opened in a normal way. So, last layer in skip index and all VInts required to decode it are normal. Hope you could help further and tell whats wrong.
TantiPy snippet:
import yaml
from tantipy import TantivyReader
store = open('e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.store', 'rb').read()
with open('schema.yaml') as file:
coder = tantipy.TantivyCoder(yaml.safe_load(file.read()))
reader = TantivyReader(store, coder)
len(list(reader.documents()))
59
Update: attempt to .get
last documents from this segment store
leads to VInt decoding error. It looks that skip_index is broken in some way.
ok thanks... I'll have a look at that today! this is very helpful.
The checkpoints seem broken, ( even those of the last layer)...
[src/store/index/skip_index.rs:80] checkpoint = (doc=[0..38), bytes=[0..6993))
[src/store/index/skip_index.rs:80] checkpoint = (doc=[38..46), bytes=[6993..8399))
[src/store/index/skip_index.rs:80] checkpoint = (doc=[46..61), bytes=[8399..10550))
[src/store/index/skip_index.rs:80] checkpoint = (doc=[61..62), bytes=[10550..10749))
[src/store/index/skip_index.rs:80] checkpoint = (doc=[62..63), bytes=[10749..11151))
[src/store/index/skip_index.rs:80] checkpoint = (doc=[63..64), bytes=[11151..11341))
[src/store/index/skip_index.rs:80] checkpoint = (doc=[64..65), bytes=[11341..11532))
[src/store/index/skip_index.rs:80] checkpoint = (doc=[65..66), bytes=[11532..11874))
[src/store/index/skip_index.rs:80] checkpoint = (doc=[58..59), bytes=[11874..12084))
I suspect a bug in the merge logic when there are deletes.
I added a functional test stress tesiting the store but I haven't been able to reproduce so far. Can you confirm that you did not build your index with a modified version of tantivy?
Probably it could help to investigate the crime scene, here is an order of my actions.
At the first step I've got 16-24 (edited: I updated exact number below) equally sized segments with all data. When I've started serving, merger had decided to merge all these large segments (it could explain initial memory bloating? memory is bloating due to failing merging, confirmed by file timestamps below). New documents had been upserting at the same time and these upsertings definetly overwritten those documents that had been under merging. And I suspect that broken segment came at this moment.
Can you confirm that you did not build your index with a modified version of tantivy?
The only difference now is the following (it is in my fork):
pasha@Pashas-MacBook-Air tantivy % git diff upstream/main origin/master
diff --git a/src/lib.rs b/src/lib.rs
index 33baf80d7..e480b3b86 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -157,7 +157,7 @@ pub use self::snippet::{Snippet, SnippetGenerator};
mod docset;
pub use self::docset::{DocSet, TERMINATED};
-pub use crate::common::HasLen;
+pub use crate::common::{BinarySerializable, HasLen};
pub use crate::common::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
pub use crate::core::{Executor, SegmentComponent};
pub use crate::core::{Index, IndexMeta, Searcher, Segment, SegmentId, SegmentMeta};
Do you have an idea of how long it took to appear? Can I send you a PR to try to detect it at the right moment if you eventually reproduce?
And I suspect that broken segment came at this moment.
This segment is pretty small. It is very unlikely that it appeared during the initial large merge.
And I suspect that broken segment came at this moment.
This segment is pretty small. It is very unlikely that it appeared during the initial large merge.
I meant this segment was a result of upsertings of new documents while large segments has been merging.
You know, I have CPU graphs and segment creation times, let me see more details. Probably I will find something.
I meant this segment was a result of upsertings of new documents while large segments has been merging.
Apart from defective RAM it should not matter though. All clues I have so far would point to a logic error in tantivy in the way we merge store... especially the stacking but I cannot find the problem yet.
Here some big bash output and probably unrelated highlights, not sure what can be useful but let it be:
Initial index state, it is splitted into two files store.tar
(*.store files) and index.tar
(all others)
(venv) pasha@host:~$ tar -tvf dump/store.tar | head
-rw-rw-r-- 1000/1000 414026422 2021-01-05 20:36 181f462a1dca4bc6b3e923dc38946afc.store
-rw-rw-r-- 1000/1000 433821910 2021-01-05 22:47 edf8a3aa77704c708a82ab17a10c0742.store
-rw-rw-r-- 1000/1000 405652611 2021-01-06 02:47 f3e0c3fa55e447bc9a56cedb9a91a379.store
-rw-rw-r-- 1000/1000 428915994 2021-01-05 21:24 cb1e90f9652a415fb767712fa9f7825b.store
-rw-rw-r-- 1000/1000 406850842 2021-01-06 01:59 42368d2b647242a1976e0ca8ae8ce6f9.store
-rw-rw-r-- 1000/1000 425311840 2021-01-05 23:21 4ece9c55151a418280aa7115ed403cf4.store
-rw-rw-r-- 1000/1000 429223477 2021-01-05 21:24 ee140cf366274069b9848a2ba5777eb2.store
-rw-rw-r-- 1000/1000 406520875 2021-01-06 01:58 d9cde8c9e643407282c3b086d19dda7c.store
-rw-rw-r-- 1000/1000 432718035 2021-01-05 22:18 e2e3d29374504338be5945b736f142ce.store
-rw-rw-r-- 1000/1000 414170229 2021-01-06 02:47 70f9e168447c4a05a009e3aee387c763.store
(venv) pasha@host:~$ tar -tvf dump/store.tar | wc -l
66
(venv) pasha@host:~$ tar -tvf dump/index.tar | head
-rw-rw-r-- 1000/1000 579024 2021-01-05 22:18 d7cd81b7ed724fab8ba7e5d830bee126.posidx
-rw-rw-r-- 1000/1000 24184884 2021-01-05 21:24 cb1e90f9652a415fb767712fa9f7825b.fieldnorm
-rw-rw-r-- 1000/1000 169514364 2021-01-05 22:18 c9de1254ccbe4888a07e9a0f117541a4.idx
-rw-rw-r-- 1000/1000 157613031 2021-01-05 23:20 386712fb611c450bb7e2fefc0afa3761.idx
-rw-rw-r-- 1000/1000 71887791 2021-01-05 23:21 4ece9c55151a418280aa7115ed403cf4.pos
-rw-rw-r-- 1000/1000 160869272 2021-01-05 19:33 7885d2b0bb7b4bd78b2085e43515965f.idx
-rw-rw-r-- 1000/1000 6189987 2021-01-06 01:58 0e492b6f62934fdb907471b953f0d825.fast
-rw-rw-r-- 1000/1000 27202835 2021-01-05 22:18 d7cd81b7ed724fab8ba7e5d830bee126.fieldnorm
-rw-rw-r-- 1000/1000 601296 2021-01-05 23:20 386712fb611c450bb7e2fefc0afa3761.posidx
-rw-rw-r-- 1000/1000 169509571 2021-01-05 22:18 076e754dae714274a871a9649cc9fd9a.idx
(venv) pasha@host:~$ tar -tvf dump/index.tar | wc -l
384
Current biggest files in index (atm merging and upserting disabled)
(venv) pasha@host:~$ ls -lS ...index| head -n 30
total 44384984
-rw-r--r-- 1 root root 26874313764 Jan 6 07:21 674af0595b314b84965eb0a7b742869f.store
-rw-r--r-- 1 root root 10255159788 Jan 6 07:21 674af0595b314b84965eb0a7b742869f.idx
-rw-r--r-- 1 root root 4641827492 Jan 6 07:21 674af0595b314b84965eb0a7b742869f.pos
-rw-r--r-- 1 root root 1516353764 Jan 6 07:21 674af0595b314b84965eb0a7b742869f.term
-rw-r--r-- 1 root root 1488021107 Jan 6 06:49 674af0595b314b84965eb0a7b742869f.fieldnorm
-rw-r--r-- 1 root root 465006709 Jan 6 07:21 674af0595b314b84965eb0a7b742869f.fast
-rw-r--r-- 1 root root 38486536 Jan 6 07:21 674af0595b314b84965eb0a7b742869f.posidx
-rw-r--r-- 1 root root 15500382 Jan 6 15:16 674af0595b314b84965eb0a7b742869f.124037308.del
-rw------- 1 root root 648549 Jan 7 06:00 meta.json
-rw-r--r-- 1 root root 100867 Jan 6 15:08 0bf1f1e1a9d748028c1a0996e5e48f94.term
-rw-r--r-- 1 root root 47941 Jan 6 15:08 0bf1f1e1a9d748028c1a0996e5e48f94.store
-rw-r--r-- 1 root root 29749 Jan 6 09:01 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.term
-rw-r--r-- 1 root root 29118 Jan 6 09:08 667290ba18bc42eebb65a52a27361ee9.term
-rw-r--r-- 1 root root 28073 Jan 6 09:42 386e3a4734ca47b29dbe884c2f03f151.term
-rw-r--r-- 1 root root 20837 Jan 6 09:08 12b7f503172a4f399e0870e0f752c3f5.term
-rw-r--r-- 1 root root 19809 Jan 6 09:38 fa709e13f63643e5a65bd5b1dfcb3260.term
-rw-r--r-- 1 root root 18426 Jan 6 09:10 33869f79d25e45ae8c38979336c05c07.term
-rw-r--r-- 1 root root 17934 Jan 6 15:08 0bf1f1e1a9d748028c1a0996e5e48f94.idx
-rw-r--r-- 1 root root 17759 Jan 6 09:07 7df9ddfa8e8c45d2b7e3ea6ff335a954.term
-rw-r--r-- 1 root root 17637 Jan 6 09:31 eb4fce9772f64e2db04e63891b49641f.term
-rw-r--r-- 1 root root 15894 Jan 6 09:13 4dfb5a6890104ae98b82fe94717a8686.term
-rw-r--r-- 1 root root 15073 Jan 6 09:31 9cc3fc25ed954e359d051d0307a4dca0.term
-rw-r--r-- 1 root root 14850 Jan 6 09:04 94a28fd8f9b94bd096ecca7f3968c1a4.term
-rw-r--r-- 1 root root 14708 Jan 6 09:41 46657e98ed934e71abf1ad7d4bcaf1b1.term
-rw-r--r-- 1 root root 14543 Jan 6 09:41 0dd7f11064b94b489d2b1f87ffdf3c97.term
-rw-r--r-- 1 root root 14273 Jan 6 09:33 5b2391b28ea4423a8abeed3b93b150c2.term
-rw-r--r-- 1 root root 14132 Jan 6 09:28 e7f6dc9cce474437845db7bbb4fd3060.term
-rw-r--r-- 1 root root 13934 Jan 6 09:10 381ec8f9ce5244d7a10044784fc011f9.term
-rw-r--r-- 1 root root 13880 Jan 6 09:41 91279dc2077a4b398b3cbdb2f4efc9e2.term
Broken segment times:
(venv) pasha@host:~$ ls -l index grep e6ece22e5bca4e0dbe7ce3e4dcbd5bbf
-rw-r--r-- 1 root root 173 Jan 6 09:01 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.124002266.del
-rw-r--r-- 1 root root 421 Jan 6 09:01 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.fast
-rw-r--r-- 1 root root 950 Jan 6 09:01 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.fieldnorm
-rw-r--r-- 1 root root 4107 Jan 6 09:01 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.idx
-rw-r--r-- 1 root root 1717 Jan 6 09:01 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.pos
-rw-r--r-- 1 root root 277 Jan 6 09:01 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.posidx
-rw-r--r-- 1 root root 12301 Jan 6 09:01 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.store
-rw-r--r-- 1 root root 29749 Jan 6 09:01 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.term
Memory bloating began right after 09:01, so it could be confirmed that it linked with failing merging.
I will add here something else if would find.
Do you have an idea of how long it took to appear? Can I send you a PR to try to detect it at the right moment if you eventually reproduce?
It appeared rather fast. Between minutes and 1-2 hours while having the rate of 100-1000 writings per minute. I have initial dump and could start everything from scratch. I could apply any PR you will send but the issue looks probabilistic and is not guaranteed to be reproduced.
I meant this segment was a result of upsertings of new documents while large segments has been merging.
Apart from defective RAM it should not matter though. All clues I have so far would point to a logic error in tantivy in the way we merge store... especially the stacking but I cannot find the problem yet.
Yes, you are right. Checked by timestamps, merging of initial large segments has ended before I started upserting new records.
And finally found a proper graph, the broken segment has been produced at 09:01 (can't be more precise cause its filetime I can) and I've enabled upserting which produced this segment right at 09:01:00-09:01:15. Not sure I have any more interesting graphs.
Updated: found how to show full fime for files.
(venv) pasha@host:~$ ls -la --time-style=full-iso ./index grep e6ece22e5bca4e0dbe7ce3e4dcbd5bbf
-rw-r--r-- 1 root root 173 2021-01-06 09:01:16.493195352 +0000 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.124002266.del
-rw-r--r-- 1 root root 421 2021-01-06 09:01:13.941154323 +0000 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.fast
-rw-r--r-- 1 root root 950 2021-01-06 09:01:13.865153102 +0000 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.fieldnorm
-rw-r--r-- 1 root root 4107 2021-01-06 09:01:13.945154388 +0000 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.idx
-rw-r--r-- 1 root root 1717 2021-01-06 09:01:13.949154453 +0000 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.pos
-rw-r--r-- 1 root root 277 2021-01-06 09:01:13.949154453 +0000 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.posidx
-rw-r--r-- 1 root root 12301 2021-01-06 09:01:13.953154516 +0000 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.store
-rw-r--r-- 1 root root 29749 2021-01-06 09:01:13.945154388 +0000 e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.term
Only a single bit seems corrupted... The md5 in the file footer is correct. I am started to suspect your RAM to be defectuous. Can you do a RAM check?
Here is a file to patch your file.
content = open("e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.store", "rb").read()
offset = 12103
patched_content = content[:offset] + bytes([content[offset] ^ (1 << 3)]) + content[offset+1:]
open("e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.store.patched", "wb").write(patched_content)
Only a single bit seems corrupted... The md5 in the file footer is correct. I am started to suspect your RAM to be defectuous. Can you do a RAM check?
Here is a file to patch your file.
content = open("e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.store", "rb").read() offset = 12103 patched_content = content[:offset] + bytes([content[offset] ^ (1 << 3)]) + content[offset+1:] open("e6ece22e5bca4e0dbe7ce3e4dcbd5bbf.store.patched", "wb").write(patched_content)
It would be epic, it is Samsung with ECC, lol Sure, I will check. But need time to offload this server.
If you have ECC that is quite unlikely... :-S
Can you confirm the patch work for you?
I can confirm that patch healed this segment but it looks that merging has failed again. I relaunched server with enabled debugging print to find out where it is failing now. Will post here results in a while.
putting on sherlock hat
putting on sherlock hat
evading hostile space rays
Sent you next broken segment. Could you apply your inverted bit detection magic there?
Ok the error is super similar but not one bit... So the one bit thing was just a coincidnece.
canceling purchase of space radiation shield for server
Logging for myself or future generation. Checkpoint are delta encoded but in packs of eight. Here in both case, the first checkpoint's start document of the second block is off by 6 (8) for the previous faulty segment, hence shifting all of the subsequent checkpoints.
you don't have logs by any chance? With RUST_LOG=info, tantivy logs which merges occur. I would like to confirm these two segments have been produced by a merge.
No. I know I should have done it while releasing from master branch but at that time I've decided to do a brave move:) Nevertheless I can enable it and start from the beginning but it will take a time.
you don't have logs by any chance? With RUST_LOG=info, tantivy logs which merges occur. I would like to confirm these two segments have been produced by a merge.
A side note imo useful for future debugging - here in Tantivy you could use some purposeful algorithm for generating random names for segments. Something like Twitter ULIDs that include timestamps and clues pointing to origin of segments.
that is a good idea. can you open a ticket?
I thnk I found the bug!I thnk I found the bug!
to reproduce, a merge needs to occur with:
In that case first_doc_in_block is not updated, which creates the bug.
Can't wait to try it!
@ppodolsky Thank you very for the patience and all of your help reporting the bug!
It was fun, thank you too!
Probably related to #897
Describe the bug In continuation of https://gitter.im/tantivy-search/tantivy?at=5ff589c503529b296bd23728 Firstly, I found memory bloating of Tantivy application. After debugging I found that merging thread have been failing every time on merging broken store in a particular segment. Example of the whole segment I've sent to your personally in Gitter.
Which version of tantivy are you using? https://github.com/tantivy-search/tantivy/commit/a4f33d3823f1bad3ff7a59877f1608615acabe6e
What happened I used poor man debugging and launched Tantivy with patched function (added print only):
Stdout after failed merging: