orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.73k stars 869 forks source link

Increase speed of incremental backup by using counting bloom filter #5149

Open andrii0lomakin opened 8 years ago

andrii0lomakin commented 8 years ago

In current implementation of incremental backup we iterate over pages to find which of them were changed, but for huge databases such process may take several hours. It is not really a problem for many cases because we do not block database on writes but sometimes backup should be done in very stiff terms. To avoid given problem we are going to introduce counting bloom filter. Just to clarify. By counting bloom filter I do not mean classic implementation of given data structure which can not be expanded in case of it reaches it's limit of capacity of registered items, but I mean family of modern algorithms with similar behaviour which allows to dynamically expand capacity of "bloom filter".

In nutshell algorithm is following:

  1. For each file we get amount of pages it contains.
  2. We do not load any page instead we merely filter indexes of pages through "bloom filter" which is matter of few binary operations for iteration.
  3. Fetch changed pages and backup them.
  4. During small "freeze" phase when we wait till ongoing atomic operations will be completed to copy database journal with changes happens to the data during backup phase we clear "bloom filter".

Actually that is almost speed limit for incremental backup because in 95% of cases we will load pages which are changed and that is action which we have to do any way.

Also as drawback part of database journal which have to be "cut" and backed up will be really small because of small period of time is needed to perform backup.

finid commented 8 years ago

Nice! Is this incremental backup feature going to be implemented in the automatic backup plugin, too?

andrii0lomakin commented 8 years ago

@finid I am responsible for storage engine but I bet that if you will create issue it will be implemented :-).

finid commented 8 years ago

Cool, I will, then!

lvca commented 8 years ago

I confirm it will be supported by automatic backup component. To track it I've created this issue: https://github.com/orientechnologies/orientdb/issues/5155

lvca commented 8 years ago

This means at every update/delete/insert we should update and save the bloom filter somewhere. What's this cost? Also, this could be seen as an improvement. I think we could stay with current solution, otherwise 2.2 will never reach the RC!

andrii0lomakin commented 8 years ago

This means at every update/delete/insert we should update and save the bloom filter somewhere. >What's this cost?

@lvca That is not true we do not need to save bloom filter anywhere.

Also cost of update of bloom filter is no more than addition time in hashmap . So I do not see any hidden costs which I did not describe.

lvca commented 8 years ago

So do want to keep the BF in RAM only?

andrii0lomakin commented 8 years ago

Sure we do not need them for long term.

andrii0lomakin commented 7 years ago

removed from backlog in favor of https://github.com/orientechnologies/orientdb/issues/7686