[OEP 7] Cluster with single page access in case of fetching record by rid

andrii0lomakin commented 8 years ago

Reference:

https://github.com/orientechnologies/orientdb-labs/blob/master/OEP_7.md

Summary:

Create cluster which allows accessing only single page for the most of the times when we need to fetch page by RID.

Goals:

Refactor cluster data structure which allows to:

Fetch single record from a cluster by loading only one disk cache page for the most of the cases.
A fetch batch of the records by RID using a theoretical minimum of pages are needed to be loaded from the disk cache.
Reuse space which is used to store rid of delete record.

Non-Goals:

Create optimization on query layer wich allows using a new cluster to improve full scan queries.

Success metrics:

The improved speed of YCSB tests in the case when only 20% of data is cached.
Improved speed of "graphdb" tests such as finding the shortest path between vertexes, a traverse of vertexes, etc.

Motivation:

Most of the queries which are executed on database may be split into two types:

Make search by index in order to fetch rids of records which match query limits and then fetch those records from a cluster.
Filter initial vertices in graph database then traverse vertices relationships by some of the critereis.

All those queries load a massive amount of pages during loading of records by rids. Instead of sorting of rids by indexes of pages which contains them and then loading of all records that belong to the single page
at once we load two pages for each rid.

Description:

We already implemented data structure which may be used to overcome all of the issues listed above, that is the data structure which is used inside of OPaginatedCluster itself, but we never used all power of this data structure.

Inside of paginated cluster data structure, each page contains its rid-record_physical_position_map which embedded inside of it. So when we create a new record we add record to the page and inside of the page we create a link to a real physical position of the record on disk. There will be three types of links:

An internal link, which will be used to refer to the record inside of the page.
External link, which will be used to refer to the record outside of page (more about this later).
Tombstone to indicate deleted record.

When update is performed following actions are done:

If there is enough free space on a current page to put record inside of the current page, then position inside of rid-postion_mab embedded inside of current page is updated.
If there is no enough free space to store the record to current page new free page is allocated and external reference is created in embedded rid-position map.

Entry of embedded rid-position map consist of two items internal pointer which consist of 2 bytes to make references inside of 65k page and delversion (more details later) which consist of 8 bytes and is positive if a link is not a tombstone and negative in other cases. So common space which is consumed by reference is 10 bytes. If we need to create external reference (which should be pretty rare case) we create an internal reference to the record which serves as an external reference to another page where the real record is allocated.

In a case of creation of records, if there is a tombstone inside of rid_positions map we reuse this tombstone as rid for a new record. Such approach allows us to reuse space of removed records completely as opposite to now. But RIDs should be unique so to keep them unique new field which is called delversion and consist of 8 bytes is introduced when a record is deleted we increment this field and then make it negative, but when the tombstone is reused we make this field positive again.

It implies that we need to change the format of RID by addition new field which will be called delversion. After this change, format of rid will be following:

Cluster id , 4 bytes (because we planned to increase an amount of bytes needed for cluster id anyway).
Cluster position will be split as following: first 2 bytes will be reserved for an index of a position of embedded rid map inside of the page , the last 6 bytes will be reserved for an index of a page inside of the cluster.
Delversion will contain 8 bytes.

If we will create let's say 3 records inside of cluster with id=10 using first page as container of those records, their rids will look like following: [#10:65537:0, #10:65538:0, #10:65539] , so as I can see adjacent clusters are palced on the same page.

This format of cluster gives following advantages:

If we fetch record by rid we for most of the cases will fetch it by loading an only single page from the disk cache.
If we need to load a set of rids (it includes cases when we traverse vertices by edges or when we load records from a query which includes usage of indexes), we will sort rids by its position and will be able to load several records at once by loading of the only single page.
We do not waste space of removed records anymore but reuse it and will resolve few requests made by users and customers.

Record preallocation functionality

Right now to avoid the cases when to get rid of record in graph database in case of recursion of serialization we need to store empty a record we perform so-called record preallocation, it means that we allocate a slot in rid-position_map but do not write real data. Then when the record will be serialized we store real record inside data file.

To keep the same functionality following is proposed because the only unknown value is rid value we may calculate size which is a need for a record before we will know real values of rids, then we may book this space in a data file and do not write real data. But when we will know rids we will write real record content, as usual, which effectively will be the same record preallocation procedure.

Alternatives:

Keep original cluster design.

Risks and assumptions:

Because of bigger space needed to store record, initial load will be slower but it is not guaranteed because to create a new record we will need one page instead of two which will also give a good speed up in record creation.
There is a risk of compatibility of versions if some of the users rely on RID format.

Impact matrix

[x] Storage engine
[x] SQL
[ ] Protocols
[ ] Indexes
[ ] Console
[ ] Java API
[ ] Geospatial
[ ] Lucene
[ ] Security
[ ] Hooks
[ ] EE

luigidellaquila commented 8 years ago

+1 Just a couple of notes about the RID representation with three numbers:

It's obvious, but I'll stress on this: with this approach RIDs will be reused and sequential insert operations won't have sequential RIDs
8 bytes for the delversion seem to be a lot, maybe 5 or 6 is enough? 5 bytes is 1000 billion values (500 billion signed), 6 bytes is 256.000 billions, IMHO it's enough...
for backward compatibility, I think we can keep the old representations of "#12:0" in queries and new ORid(12, 0), without the delversion, that means the record at that position, regardless the delversion. Of course in the physical layer the delversion will remain to guarantee that broken links are not wrongly traversed.

Thanks

Luigi

andrii0lomakin commented 8 years ago

Hi Luigi, Let's put 6 bytes, I thought about 8 bytes because its limits ensure that it will never overflow for all time of existence of universe )) But I think even 6 bytes is enough.

About query format , I think it is possible to support legacy format, but IMHO we should provide warning about that.

lvca commented 8 years ago

This change is big, so let's evaluate if/when makes sense to do that. On modern HDD loading an entire page is about 5-8ms. But on a 1TB SSD of $300 you can have 0.086ms to load a 4k page, let's say the double of time to load 64k page: 0.172ms. So with current structure, in the worst case we spend +0.172ms more foreach record. Traversing 10k records in the worst case (all random, no data in cache, we're very unlucky) is 1.7 seconds more. In the reality PCM files have 12 bytes per records, so every page has 5333 record pointers. If I'm reading sorted by RID the chance to have RID in RAM is high.

This is just to say that the actual improvement depends by the use case. But the really things that matters to me today is all the rest. If today for each traversed record we spend 2ms (loading of the page(s) + unmarshalling + security + hooks + all the rest), traversing 10k non cached records is 20 seconds.

So IMHO it doesn't matter if today with all the work of this proposal we reduce of just 1.7 seconds the 20 seconds to be 18.3. I think it's much better to focus on reducing the 2ms time per record and for this we already have many (smaller) issues to achieve this result.

So it's not to go against this proposal, it's just to avoid we lost our focus on something that today can provide only small improvements, while with the same time we know we can improve much more our performance.

(NOTE: all these timings weren't measured in benchmark, so are not accurate)

andrii0lomakin commented 8 years ago

Hi @lvca. I have 2 questions:

Could you point me test according to which you provide this reasoning? I mean you write that the test which fetches 10K non-cached records took 20 seconds and fetch time of the page in this test was 0.17 ms. Could you point me to the code so I will measure page fetch time by YourKit ? I think for reasoning we should operate precise numbers but not approximate ones which may lead us to quite different conclusions.

Meanwhile, I will repeat our graph benchmark about traversing of graph items on could cache and publish measurements during a week or so, once I will finish current task.

The second question is, do we limit our database to SSD only case or should improve the speed of HDD case too. Our HDD performance also very important. Do you agree ?

andrii0lomakin commented 8 years ago

Hi @lvca, If you remember I have already created POC for a feature which I described in this OEP. The letter is titled "Data reads 1.4 times faster with new cluster POC, that is not the end"

I will quote it there, because that is concrete proove that your reasoning is not correct. " Hi guys,

I have managed to create POC which ensures that we access at least one page during record read and as maximum 2 pages.

Just to mention right now we have cluster which always requires 2 random IO on cluster read.

Luigi that is why we have such awful performance on cluster scan on cold cache we always read several pages using random io approach on cluster scan.

I have created several benchmarks: I have created cluster iteration benchmark for single profile cluster, because we have 15 such cluster on pokec database I have multiplied result on 15 and got number which is close to performance of group by query. I have create poc of new cluster with guarantees which I listed above, and migrated profile cluster data to the new cluster.

So for warm cache: Current cluster shows 2513 ms for full scan. New cluster shows 1845 ms for full scan. So for warm cache new cluster is 1.4 times faster and that is not the end.

On cold cache : Current cluster shows 79950 ms for full scan (awful number :-) ) New cluster shows 26327 ms for full scan (which is much better :-) )

So for cold cache new cluster is 3 times faster.

Still that is not the end of whole optimizations.

And of course we need much more work to be done there. But I think results are very good.

This code in https://github.com/orientechnologies/orientdb/tree/single-page-cluster branch. You may test it yourself but before that you should fill in new cluster with data so you should call ReadSpeedBenchmarks#createSinglePageCluster() and then run one of the tests singlePageClusterReadBenchmark() or clusterReadBenchmark() . You also should uncomment whole code because I commented it to run "cold cache" test. "

Please pay attention to the words highlighted in bold, I think that is practical prove that your reasoning about the small outcome of this change is incorrect. Creation of this POC was taken 4-5 days. It is not large work, and I think I may repeat it. Sad that I removed this branch.

I want to add that I have tested on HDD but warm cache gives results which are better than SSD, so I think this change is applicable for both type of drives HDD and SSD. And that is really worth to implement change which gives 1.4 times or 40% speed up on warm cache.

andrii0lomakin commented 8 years ago

Hi @lvca , You have asked me in the letter "Good results! But I don't understand what you changed in the cluster layout to achieve less reads. Could you please explain it better?"

and

My next letter which I sent you (you may double check it) which proves that I tested on exactly the same structure which is described in OEP:

" There are following improvements in this area:

Ensure that instead of layout when we have 2 page accesses for each record read , we created layout when we have 1 page access for newly created records for sure, and for updated records we have 1 page access with some probability (if records grow no more than on 1/4 of their size on average or shrink we for sure have 1 page access) and at most 2 page access at any case. To achieve this position-cluster map was removed and i's functionality was moved to main cluster file. So now this cluster contains following types of records:
Record itself.
Link to real position to the record, it is created if we updated record and it does not fit into page so it is moved to another position (external link in OEP).
List of links to the record data it is created if record does not fit into the page during creation (it is really rare case record should be more than 48.7 kbytes to make it happens).
Tombstone if record is removed. Minimization of interaction with direct memory , we have room there, I think we will have even better result. Changes of cluster API because current one really makes all things slow In OEP it is described on moe details. "

andrii0lomakin commented 8 years ago

@lvca I think results of POC wich shows 40% speed on hot (all in memory case) and 300% speed up on cold (pages on disk case) prove that this change is worth to be included ASAP in releases. Also, I do not agree with your reasoning because you rely on numbers which you take not from benchmarks but intuition and real benchmark numbers as you can see and we may double check that is very different.

lvca commented 8 years ago

Are your results from a scan, right? So you could just put the RID inside the record (the format could be ) and you could already do a scan without the use of the PCM. This change is very simple (1-2 days), doesn't break anything and it would have the same improvement as you reported, right?

andrii0lomakin commented 8 years ago

Hi @lvca , That is results not from the scan I intentionally omitted scan (because there we have optimization of page prefetching) and fetched all records directly to prove that fetch by rid will be faster. You may call it a scan, but I did not use fetch of all records from a single page at once I loaded them one by one like if you fetch them from indexes or as suggested by Luigi from graph query.

And I think that 300% speed on POC on cold cache suggests that we should at least repeat the POC and measure performance again. And then if the result will be good to implement it as the new cluster.

About of your question of a full scan, yes it will be obviously faster but will require changes in the binary format of the cluster, will we add it in 2.2.x of to later version? But please note that, in general,it is non-goal of this optimization.

Also my question about graph traversal. What test did you use when you write that with page load time which is 0.17 ms. graph traversal takes 20 s. on 10 k pages. I think that this comparison of times is not completely correct and I want to profile this case to double check page load time.

Also, about my question with HDD, should we improve HDD speed too or should base our reasoning only on SDD ? Or more precisely if some case is fast on SDD , should we neglect the fact that on HDD it will be slow ?

andrii0lomakin commented 8 years ago

The last question if time to load page is so small, why on my tests cold cache took 79950 ms without optimization, but hot cache (without optimization) took 2513 ms or other words was required 32 times more time without optimization. It directly means that time is needed to load pages 30 times bigger that time is needed to execute code on this pages. So if we shrink this time up to two times it will be noticeable, does not it ? (and actually this result is result of performance optimization on cold cache with new cluster, even better)

I defentelly looking forward to profile test about which you write that in this test page fetch takes much smaller time time than code exectuion, it is defenetelly perplexing fact for me.

lvca commented 8 years ago

This big time could be not due to the reading of the page, but on our code that executes the read. How many pages did you read in your test?

About HDD obviously I care about it, but it's always a balance between cost and benefits. With such low cost for a SSD today, the performance are order of magnitudes better. So improving performance on slow devices is not our priority now.

That's said, if in your earlier test it spent 79950ms, how much was the page load cost? I mean the Java I/O call? And how much is the cost of DiskCache, locks, etc?

andrii0lomakin commented 8 years ago

Hi @lvca , in both test I read the same amount of records (whole pokec database). So the same code time is spent to load the same amount of records. How will it be slower on the same amount of records to execute the same amount of code. Both times for code execution should be the same, do not they ?

About how I did the test. I explain

I rebooted PC.
Run test with optimization.
Rebooted PC
Run test without optimization.

And if I remember correctly I double checked , repeated the same sequence again, to do not produce resulst which will give false hope (it was some time ago so I am not sure how much times I repeated this sequence)

andrii0lomakin commented 8 years ago

That's said, if in your earlier test it spent 79950ms, how much was the page load cost? I mean the Java I/O call? And how much is the cost of DiskCache, locks, etc?

I did not measure page load cost, but that is what I intended to do. Give me test which you referred too. I will profile it on cold and hot caches.

lvca commented 8 years ago

My question remains, I try to explain it better: if in your earlier test it spent 79950ms, how much was the page load cost? I mean the Java I/O call? And how much is the cost of inserting the loaded buffer into the DiskCache, managing locks, copying of memory areas, etc? I think that loading a page from I/O is X, then inserting that page in DiskCache is Y, where Y is at least 3x. Reducing the number of loaded pages with this proposal at half, it would improve less than improving how we store the loaded page in RAM with zero problems of compatibility and probably less effort on development and testing. This is just a theory, we need numbers.

andrii0lomakin commented 8 years ago

@lvca I did not measure page load cost, but that is what I intended to do. Give me test which you referred too. I will profile it on cold and hot caches.

andrii0lomakin commented 8 years ago

I answered few minutes ago ))

lvca commented 8 years ago

You could use the same test as you mention, adding a static AtomicLong counter that increment the nanoseconds of only the I/O api call, then on closing print the amount.

andrii0lomakin commented 8 years ago

OK let me do it. I will do two tests then. One to measure fetching of records from a cluster (not a scan) and the second for example "short path" test on Wikipedia data. Both on cold caches and hot caches. That will be interesting. Will keep you posted.

lvca commented 8 years ago

You can use the StressTester to execute a shortest path on an existing database between all the vertices against all the others:

stresstester.sh -m plocal -d /temp/mydb -c 8 -w GSP

To put a limit of maximum 10,000 vertices for the Shortest Path workload do:

stresstester.sh -m plocal -d /temp/mydb -c 8 -w GSP:L10000

Eric24 commented 8 years ago

+1

Note: This proposal includes the ability to reuse space from deleted records, and other references to that specific enhancement request have been closed as a result. I just want to make the suggestion that if this proposal is not adopted, or is substantially delayed, and it's possible to implement the ability to reuse space from deleted records as a separate project, that feature would still be very valuable on its own.

smolinari commented 8 years ago

+1 to what Eric said! :smile:

Scott

rdelangh commented 8 years ago

+1 Different topics mention the requirement to have a neat (=fast) mechanism to 'containerize' groups of data so that

an single group of such data (say: all old records of > 1 year old) can quickly be trashed or backed up/restored
an single group of such data can quickly be scanned (SELECT FROM cluster:...)

This topic here discusses improvements on the clusters, among others the reuse of space from deleted records, but isn't there any corresponding change needed on the indexes ?

My concern is that -when you trash (with DROP CLUSTER) a complete, single cluster which contains old records- I expect that the indexes of those records are also deleted in the same action. But the indexes are stored in different files, no ? Will there occur gaps in these index files, that could/should be reused by new indexes ? Is there any such mechanism already in place that avoids that we have growing index files because space from removed indexes is not reused ?

Also, can it happen that the index-files in a distributed setup (multiple servers) become inconsistent with the list of clusters, from which just one cluster was dropped ? I refer to traditional relational databases with a 'partitioning' concept, where indexes are also stored in separate partitions. Dropping a data partition, also drops the corresponding index partition, so that everyting remains consistent.

andrii0lomakin commented 8 years ago

Hi @rdelangh , Spaces in indexes are already reused , so if you drop some values, then new items will reuse the same space inside of index.

tglman commented 8 years ago

Hi All,

for keeping the version information inside the record id position there is also an alternative solution, that is encode the version using double, keeping the position as the integer part of the double, and the version as the decimal part of the double, to have the the biggest reuse possible of the disk position for a given id.

This may trick a bit the varint serialization of the rid, but it would be easily solved writing the 3 value in 3 separate varint, keeping the current advantages of the varint serialization.

Regards

andrii0lomakin commented 8 years ago

Hi @tglman , yes that is interesting solution.

Eric24 commented 8 years ago

Maybe I'm not understanding your suggestion, but why does it matter whether you encode the RID as C:R:V vs C:R.V (cluster/record/version, with the second example using the 'double' concept)? As I understand your second sentence, you are essentially advocating C:R:V, so what is the advantage of using a double internally?

On Thu, Jul 28, 2016 at 9:33 AM, tglman notifications@github.com wrote:

Hi All,

for keeping the version information inside the record id position there is also an alternative solution, that is encode the version using double, keeping the position as the integer part of the double, and the version as the decimal part of the double, to have the the biggest reuse possible of the disk position for a given id.

This may trick a bit the varint serialization of the rid, but it would be easily solved writing the 3 value in 3 separate varint, keeping the current advantages of the varint serialization.

Regards

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/orientechnologies/orientdb-labs/issues/7#issuecomment-235913066, or mute the thread https://github.com/notifications/unsubscribe-auth/AAp1BPZSMznR53hfKMmqAnJnT3EuwhKzks5qaL3FgaJpZM4JHEa9 .

lvca commented 8 years ago

@tglman I think you used double, but you wanted to say "decimal", because Double by definition loose precision (IEEE 754 floating-point "double format" bit layout).

luigidellaquila commented 8 years ago

IMHO this solution does not make much sense as long as we use varint to represent single parts of the RID

orientechnologies / orientdb-labs

[OEP 7] Cluster with single page access in case of fetching record by rid #7