infiniflow / infinity

The AI-native database built for LLM applications, providing incredibly fast hybrid search of dense vector, sparse vector, tensor (multi-vector), and full-text
https://infiniflow.org
Apache License 2.0
2.68k stars 275 forks source link

[Bug]: cluster test shutdown, recover repeatedly #2288

Open small-turtle-1 opened 16 hours ago

small-turtle-1 commented 16 hours ago

Is there an existing issue for the same bug?

Version or Commit ID

d54db1a

Other environment information

No response

Actual behavior and How to reproduce it

run

python -m pytest -s -x --infinity_path=your/infinity/executable/path --infinity_dir=$PWD -k test_cluster_leader_shutdown_and_recover_in_follower

pytest log shows

Minio container minio_host already exists.
Start process for node node1: ['build/Debug/src/infinity', '--config=conf/leader.toml']
Launch node1 successfully. pid: 3688718
Waiting for node node1 to open port 23821 for 0.00039696693420410156 seconds.
Start process for node node2: ['build/Debug/src/infinity', '--config=conf/follower.toml']
Launch node2 successfully. pid: 3688940
Waiting for node node2 to open port 23822 for 7.224082946777344e-05 seconds.
Set node node1 as leader.
Set node node2 as follower.
Uninit node node1
test_i: 0, leader node1 shutdown when insert, insert_line: 902
process 3688718 exited with code 0
test_i: 0, shutdown leader node1, insert_line: 902
test_i: 0, verify data, node_name: node2
Set node node2 as leader.
Start process for node node1: ['build/Debug/src/infinity', '--config=conf/leader.toml']
Launch node1 successfully. pid: 3690360
Waiting for node node1 to open port 23821 for 0.00010395050048828125 seconds.
Set node node1 as follower.
test_i: 0, verify data, node_name: node1
Uninit node node2
test_i: 1, leader node2 shutdown when insert, insert_line: 1625
process 3688940 exited with code 0
test_i: 1, shutdown leader node2, insert_line: 1625
test_i: 1, verify data, node_name: node1
Set node node1 as leader.
Start process for node node2: ['build/Debug/src/infinity', '--config=conf/follower.toml']
Launch node2 successfully. pid: 3691504
Waiting for node node2 to open port 23822 for 0.0001227855682373047 seconds.
Set node node2 as follower.
test_i: 1, verify data, node_name: node2
Uninit node node1
test_i: 2, leader node1 shutdown when insert, insert_line: 2604
process 3690360 exited with code 0
test_i: 2, shutdown leader node1, insert_line: 2604
test_i: 2, verify data, node_name: node2
Set node node2 as leader.
Start process for node node1: ['build/Debug/src/infinity', '--config=conf/leader.toml']
Launch node1 successfully. pid: 3692925
Waiting for node node1 to open port 23821 for 9.608268737792969e-05 seconds.
Set node node1 as follower.
test_i: 2, verify data, node_name: node1
Uninit node node2
test_i: 3, leader node2 shutdown when insert, insert_line: 3294
process 3691504 exited with code 0
test_i: 3, shutdown leader node2, insert_line: 3294
test_i: 3, verify data, node_name: node1
Set node node1 as leader.
Start process for node node2: ['build/Debug/src/infinity', '--config=conf/follower.toml']
Launch node2 successfully. pid: 3694008
Waiting for node node2 to open port 23822 for 0.00012302398681640625 seconds.
Set node node2 as follower.
test_i: 3, verify data, node_name: node2
Uninit node node1
test_i: 4, leader node1 shutdown when insert, insert_line: 4222
process 3692925 exited with code 0
test_i: 4, shutdown leader node1, insert_line: 4222
test_i: 4, verify data, node_name: node2
Set node node2 as leader.
Start process for node node1: ['build/Debug/src/infinity', '--config=conf/leader.toml']
Launch node1 successfully. pid: 3695381
Waiting for node node1 to open port 23821 for 9.989738464355469e-05 seconds.
Set node node1 as follower.
test_i: 4, verify data, node_name: node1
Uninit node node2
test_i: 5, leader node2 shutdown when insert, insert_line: 5057
process 3694008 exited with code 0
test_i: 5, shutdown leader node2, insert_line: 5057
test_i: 5, verify data, node_name: node1
Set node node1 as leader.
Start process for node node2: ['build/Debug/src/infinity', '--config=conf/follower.toml']
Launch node2 successfully. pid: 3696612
Waiting for node node2 to open port 23822 for 0.00014901161193847656 seconds.
Set node node2 as follower.
test_i: 5, verify data, node_name: node2
Uninit node node1
test_i: 6, leader node1 shutdown when insert, insert_line: 6025
process 3695381 exited with code 0
test_i: 6, shutdown leader node1, insert_line: 6025
test_i: 6, verify data, node_name: node2
Set node node2 as leader.
Start process for node node1: ['build/Debug/src/infinity', '--config=conf/leader.toml']
Launch node1 successfully. pid: 3698035
Waiting for node node1 to open port 23821 for 0.00011110305786132812 seconds.
Set node node1 as follower.
test_i: 6, verify data, node_name: node1
Uninit node node2
test_i: 7, leader node2 shutdown when insert, insert_line: 6742
process 3696612 exited with code 0
test_i: 7, shutdown leader node2, insert_line: 6742
test_i: 7, verify data, node_name: node1
Set node node1 as leader.
Start process for node node2: ['build/Debug/src/infinity', '--config=conf/follower.toml']
Launch node2 successfully. pid: 3699227
Waiting for node node2 to open port 23822 for 0.00010991096496582031 seconds.
Set node node2 as follower.
test_i: 7, verify data, node_name: node2
Uninit node node1
test_i: 8, leader node1 shutdown when insert, insert_line: 7554
process 3698035 exited with code 0
test_i: 8, shutdown leader node1, insert_line: 7554
test_i: 8, verify data, node_name: node2
Set node node2 as leader.
Start process for node node1: ['build/Debug/src/infinity', '--config=conf/leader.toml']
Launch node1 successfully. pid: 3700473
Waiting for node node1 to open port 23821 for 0.00010037422180175781 seconds.
Set node node1 as follower.
test_i: 8, verify data, node_name: node1
Uninit node node2
test_i: 9, leader node2 shutdown when insert, insert_line: 8308
process 3699227 exited with code 0
test_i: 9, shutdown leader node2, insert_line: 8308
test_i: 9, verify data, node_name: node1
Set node node1 as leader.
Start process for node node2: ['build/Debug/src/infinity', '--config=conf/follower.toml']
Launch node2 successfully. pid: 3701666
Waiting for node node2 to open port 23822 for 0.00012540817260742188 seconds.
Set node node2 as follower.
Uninit node node2
process 3701666 exited with code -6
Uninit node node1
process 3700473 exited with code 0
Exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

infinity stderror shows:

Set node node2 as follower.
libc++abi: terminating due to uncaught exception of type infinity::UnrecoverableException@infinity_exception: Unable to download object: infinity/01935349-e7fd-7b38-9d71-42e289365d46, reason: file /var/infinity/follower/persistence/01935349-e7fd-7b38-9d71-42e289365d46 already exists@src/storage/io/s3_client_minio.cpp:39

So I view follower's log. The last few lines shows:

[17:52:53.606] [3701808] [info] Open new wal file /var/infinity/follower/wal/wal.log
[17:52:53.606] [3701808] [debug] WAL Entry: 
======= WAL ENTRY =======
[HEADER]
txn id: 11306
commit ts: 22552
size: 72
[CHECKPOINT]
Checkpoint: 
catalog path: catalog/FULL.22552.json
max commit ts: 22552
is full checkpoint: 1
========================

[17:52:53.607] [3701677] [info] Downloading object FULL.22552.json from infinity to /var/infinity/follower/data/catalog/FULL.22552.json
[17:52:53.618] [3701677] [info] infinity/FULL.22552.json downloaded to /var/infinity/follower/data/catalog/FULL.22552.json successfully
[17:52:53.624] [3701808] [info] TxnManager is started.
[17:52:53.624] [3701808] [info] WAL manager is starting...
[17:52:53.624] [3701808] [info] Open wal file: /var/infinity/follower/wal/wal.log
[17:52:53.624] [3701808] [info] WAL manager is started.
[17:52:53.624] [3701808] [info] Background processor is started.
[17:52:53.624] [3701808] [info] Periodic trigger start ...
[17:52:53.625] [3701894] [info] Start in FOLLOWER role
[17:52:53.635] [3701677] [info] Downloading object 01935349-e7fd-7b38-9d71-42e289365d46 from infinity to /var/infinity/follower/persistence/01935349-e7fd-7b38-9d71-42e289365d46
[17:52:53.643] [3701677] [info] infinity/01935349-e7fd-7b38-9d71-42e289365d46 downloaded to /var/infinity/follower/persistence/01935349-e7fd-7b38-9d71-42e289365d46 successfully
[17:52:53.643] [3701677] [info] Downloading object 01935349-e7fd-7b38-9d71-42e289365d46 from infinity to /var/infinity/follower/persistence/01935349-e7fd-7b38-9d71-42e289365d46
[17:52:53.643] [3701677] [critical] Unable to download object: infinity/01935349-e7fd-7b38-9d71-42e289365d46, reason: file /var/infinity/follower/persistence/01935349-e7fd-7b38-9d71-42e289365d46 already exists
[17:52:53.652] [3701677] [critical] Error: Unable to download object: infinity/01935349-e7fd-7b38-9d71-42e289365d46, reason: file /var/infinity/follower/persistence/01935349-e7fd-7b38-9d71-42e289365d46 already exists
[17:52:53.652] [3701677] [critical] 0, build/Debug/src/infinity(___interceptor_backtrace+0x4a) [0x641797f719fa]
[17:52:53.652] [3701677] [critical] 1, build/Debug/src/infinity(+0x16c6539) [0x641798a38539]
[17:52:53.652] [3701677] [critical] 2, build/Debug/src/infinity(+0x16c75d5) [0x641798a395d5]
[17:52:53.652] [3701677] [critical] 3, build/Debug/src/infinity(+0x22c5b93) [0x641799637b93]
[17:52:53.652] [3701677] [critical] 4, build/Debug/src/infinity(+0x22ba99c) [0x64179962c99c]
[17:52:53.652] [3701677] [critical] 5, build/Debug/src/infinity(+0x22bc3ba) [0x64179962e3ba]
[17:52:53.652] [3701677] [critical] 6, build/Debug/src/infinity(+0x22bc355) [0x64179962e355]
[17:52:53.652] [3701677] [critical] 7, build/Debug/src/infinity(+0x22bc32d) [0x64179962e32d]
[17:52:53.652] [3701677] [critical] 8, build/Debug/src/infinity(+0x22bbff6) [0x64179962dff6]
[17:52:53.652] [3701677] [critical] 9, build/Debug/src/infinity(+0xc53a1d) [0x641797fc5a1d]
[17:52:53.652] [3701677] [critical] 10, /lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fe79fe9ca94]
[17:52:53.652] [3701677] [critical] 11, /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fe79ff29c3c]

And I attach the complete leader and follower's logs. leader log: infinity.log follower log: infinity.log

Expected behavior

No response

Additional information

No response