Check to see how replace works.

donbing007 commented 4 years ago

Replace updates data that already exists and marks delete before inserting, right?

Does a query between the completion mark and the successful insertion of a new record cause the query to fail to hit?

tomatolog commented 4 years ago

not quite sure what do you want to check?

could you add queries that you want to check along its output?

barryhunter commented 4 years ago

I'm guessing the question is 'REPLACE' atomic? The delete+insert happen in one step?

... such that concurrent SELECT query, would that ever see the situation that the DELETE has happened, But the INSERT not yet? ie the select wouldn't see the old nor the new row in the result.

donbing007 commented 4 years ago

Yeah, that's what I'm asking.

Replace marks deletion before a new record has been inserted. Is the old record in a non-searchable state?

What is the state of the old record if it is currently between steps 1 and 2? The REPLACE transaction has not been committed.

If there are concurrent requests.

Old records will be searched normally.
The old record will no longer exist.

Which case is it?

klirichek commented 4 years ago

Insert/delete/replace rendered into one isolated block, that is 'transaction'. That is not visible until completed. That is like when you create 'delta' index which adds something and maybe deletes something. Such transaction is first written into binlog, and then came to replication provider. When it is distributed over all nodes, provider initiates applying transaction. In that moment transaction became just one more RAM segment of the index. That is - running queries see previous state of the index without new transaction (i.e. nothing inserted, nothing deleted). At the same time daemon makes lightweight modified state with new segment added. Finally previous state retired, and new queries see just new index with all modifications. At this point modifying query returned and you see how many rows affected. Transaction is stored in RAM, and also saved in binlog. When you flush rt index, all RAM segments stored into .ram file, and binlog cleaned. When size of RAM chunk (that is: collection of RAM segments) achieves rt_memlimit, or when you issue 'flush ramchunk', that data became converted into dedicated disk chunk which becames part of the index.

donbing007 commented 4 years ago

I don't know if I got it right.

Another concurrent transaction prior to the commit of the replace command is guaranteed to query the data before the replace command executes, right?

begin;
insert into test(id, test) values(1, 'value0'), (2, 'value1');
commit;

begin;
replace into test(id, test) (1, 'value00');
commit;

If the following query executes concurrently with replace, it should be possible to query a record with id 1, right?

select id from test where test='value0'

githubmanticore commented 4 years ago

➤ Aleksey N. Vinogradov commented:

There are 2 transactions here. parallel query will race for that value, since may came either between transactions, either after both.

donbing007 commented 4 years ago

So I can understand that the replace command is atomic, there's no intermediate state, right?

The other transaction sees either the old value or the new value, right?

sanikolaev commented 4 years ago

Let me comment on this. REPLACE is not guaranteed to be atomic as Manticore is not fully ACID compliant. Here's a simple test which shows that in case of permanent REPLACE'ing the document is unavailable in ~0.6% of cases:

Load:

snikolaev@dev:~$ while true; do mysql -P9306 -h0 -e "replace into t values(123,'abc')"; done;

Test:

snikolaev@dev:~$ while true; do mysql -P9306 -h0 -e "select * from t where match('abc');"|wc -l|grep -v 2; done;
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

snikolaev@dev:~$ (for n in `seq 1 1000`; do mysql -P9306 -h0 -e "select * from t where match('abc');"|wc -l; done)|sort|uniq -c
      6 0
    994 2
snikolaev@dev:~$

There are ways to make REPLACE more atomic, but it's not easy and has never been a priority.

donbing007 commented 4 years ago

Let me comment on this. REPLACE is not guaranteed to be atomic as Manticore is not fully ACID compliant. Here's a simple test which shows that in case of permanent REPLACE'ing the document is unavailable in ~0.6% of cases:

Load:
snikolaev@dev:~$ while true; do mysql -P9306 -h0 -e "replace into t values(123,'abc')"; done;
Test:
snikolaev@dev:~$ while true; do mysql -P9306 -h0 -e "select * from t where match('abc');"|wc -l|grep -v 2; done;
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

snikolaev@dev:~$ (for n in `seq 1 1000`; do mysql -P9306 -h0 -e "select * from t where match('abc');"|wc -l; done)|sort|uniq -c
      6 0
    994 2
snikolaev@dev:~$
There are ways to make REPLACE more atomic, but it's not easy and has never been a priority.

The scenario I'm using now may face this problem. Any Suggestions on how to deal with it?

At the beginning, our test may not be perfect, and we did not find this problem. Or a false sense of elasticSearch being the same.

sanikolaev commented 2 years ago

With a bigger batch size the situation looks worse:

Without REPLACE load:

snikolaev@dev:~$ (for n in `seq 1 1000`; do mysql -NB -P9306 -h0 -e "select count(*) from t where match('abc');"; done)|sort|uniq -c
   1000 1000

Under the load:

snikolaev@dev:~$ while true; do (echo -e "replace into t values"; for n in `seq 1 999`; do echo -n "($n,'abc'),"; done; echo -n "(1000,'abc');")|mysql -P9306 -h0; done;

SELECT:

snikolaev@dev:~$ (for n in `seq 1 1000`; do mysql -NB -P9306 -h0 -e "select count(*) from t where match('abc');"; done)|sort|uniq -c
      3 0
    994 1000
      1 155
      1 241
      1 566
snikolaev@dev:~$ (for n in `seq 1 1000`; do mysql -NB -P9306 -h0 -e "select count(*) from t where match('abc');"; done)|sort|uniq -c
      3 0
    997 1000
snikolaev@dev:~$ (for n in `seq 1 1000`; do mysql -NB -P9306 -h0 -e "select count(*) from t where match('abc');"; done)|sort|uniq -c
      4 0
    994 1000
      1 31
      1 713
snikolaev@dev:~$ (for n in `seq 1 1000`; do mysql -NB -P9306 -h0 -e "select count(*) from t where match('abc');"; done)|sort|uniq -c
      4 0
    994 1000
      1 178
      1 496
snikolaev@dev:~$ (for n in `seq 1 1000`; do mysql -NB -P9306 -h0 -e "select count(*) from t where match('abc');"; done)|sort|uniq -c
      4 0
    995 1000
      1 527
snikolaev@dev:~$ (for n in `seq 1 1000`; do mysql -NB -P9306 -h0 -e "select count(*) from t where match('abc');"; done)|sort|uniq -c
      2 0
    994 1000
      1 215
      1 248
      1 310
      1 93

i.e. the chance is about the same - 0.6%, but you can miss the whole batch which is being replaced or a good part of it.

donbing007 commented 2 years ago

This problem is actually fatal to real-time indexing. Is there a plan to fix it?

Manticore has a real-time advantage over elasticsearch, but this problem will offset this advantage.

sanikolaev commented 2 years ago

The core team had a long discussion / brainstorming. In general there are 2 options:

smth like REPLACE ... OPTION atomic=1 - can be done fast, but the performance may be suboptimal. But it's worth testing.
copy-on-write, so each SELECT is kind of isolated in terms of the kill-lists data it has access to. Much more difficult to implement. Can cause extra RAM consumption.

manticoresoftware / manticoresearch

Check to see how replace works. #397