Open andrii0lomakin opened 8 years ago
+1 Just a couple of notes about the RID representation with three numbers:
"#12:0"
in queries and new ORid(12, 0)
, without the delversion, that means the record at that position, regardless the delversion. Of course in the physical layer the delversion will remain to guarantee that broken links are not wrongly traversed.Thanks
Luigi
Hi Luigi, Let's put 6 bytes, I thought about 8 bytes because its limits ensure that it will never overflow for all time of existence of universe )) But I think even 6 bytes is enough.
About query format , I think it is possible to support legacy format, but IMHO we should provide warning about that.
This change is big, so let's evaluate if/when makes sense to do that. On modern HDD loading an entire page is about 5-8ms. But on a 1TB SSD of $300 you can have 0.086ms to load a 4k page, let's say the double of time to load 64k page: 0.172ms. So with current structure, in the worst case we spend +0.172ms more foreach record. Traversing 10k records in the worst case (all random, no data in cache, we're very unlucky) is 1.7 seconds more. In the reality PCM files have 12 bytes per records, so every page has 5333 record pointers. If I'm reading sorted by RID the chance to have RID in RAM is high.
This is just to say that the actual improvement depends by the use case. But the really things that matters to me today is all the rest. If today for each traversed record we spend 2ms (loading of the page(s) + unmarshalling + security + hooks + all the rest), traversing 10k non cached records is 20 seconds.
So IMHO it doesn't matter if today with all the work of this proposal we reduce of just 1.7 seconds the 20 seconds to be 18.3. I think it's much better to focus on reducing the 2ms time per record and for this we already have many (smaller) issues to achieve this result.
So it's not to go against this proposal, it's just to avoid we lost our focus on something that today can provide only small improvements, while with the same time we know we can improve much more our performance.
(NOTE: all these timings weren't measured in benchmark, so are not accurate)
Hi @lvca. I have 2 questions:
Could you point me test according to which you provide this reasoning? I mean you write that the test which fetches 10K non-cached records took 20 seconds and fetch time of the page in this test was 0.17 ms. Could you point me to the code so I will measure page fetch time by YourKit ? I think for reasoning we should operate precise numbers but not approximate ones which may lead us to quite different conclusions.
Meanwhile, I will repeat our graph benchmark about traversing of graph items on could cache and publish measurements during a week or so, once I will finish current task.
The second question is, do we limit our database to SSD only case or should improve the speed of HDD case too. Our HDD performance also very important. Do you agree ?
Hi @lvca, If you remember I have already created POC for a feature which I described in this OEP. The letter is titled "Data reads 1.4 times faster with new cluster POC, that is not the end"
I will quote it there, because that is concrete proove that your reasoning is not correct. " Hi guys,
I have managed to create POC which ensures that we access at least one page during record read and as maximum 2 pages.
Just to mention right now we have cluster which always requires 2 random IO on cluster read.
Luigi that is why we have such awful performance on cluster scan on cold cache we always read several pages using random io approach on cluster scan.
I have created several benchmarks: I have created cluster iteration benchmark for single profile cluster, because we have 15 such cluster on pokec database I have multiplied result on 15 and got number which is close to performance of group by query. I have create poc of new cluster with guarantees which I listed above, and migrated profile cluster data to the new cluster.
So for warm cache: Current cluster shows 2513 ms for full scan. New cluster shows 1845 ms for full scan. So for warm cache new cluster is 1.4 times faster and that is not the end.
On cold cache : Current cluster shows 79950 ms for full scan (awful number :-) ) New cluster shows 26327 ms for full scan (which is much better :-) )
So for cold cache new cluster is 3 times faster.
Still that is not the end of whole optimizations.
And of course we need much more work to be done there. But I think results are very good.
This code in https://github.com/orientechnologies/orientdb/tree/single-page-cluster branch. You may test it yourself but before that you should fill in new cluster with data so you should call ReadSpeedBenchmarks#createSinglePageCluster() and then run one of the tests singlePageClusterReadBenchmark() or clusterReadBenchmark() . You also should uncomment whole code because I commented it to run "cold cache" test. "
Please pay attention to the words highlighted in bold, I think that is practical prove that your reasoning about the small outcome of this change is incorrect. Creation of this POC was taken 4-5 days. It is not large work, and I think I may repeat it. Sad that I removed this branch.
I want to add that I have tested on HDD but warm cache gives results which are better than SSD, so I think this change is applicable for both type of drives HDD and SSD. And that is really worth to implement change which gives 1.4 times or 40% speed up on warm cache.
Hi @lvca , You have asked me in the letter "Good results! But I don't understand what you changed in the cluster layout to achieve less reads. Could you please explain it better?"
and
My next letter which I sent you (you may double check it) which proves that I tested on exactly the same structure which is described in OEP:
" There are following improvements in this area:
@lvca I think results of POC wich shows 40% speed on hot (all in memory case) and 300% speed up on cold (pages on disk case) prove that this change is worth to be included ASAP in releases. Also, I do not agree with your reasoning because you rely on numbers which you take not from benchmarks but intuition and real benchmark numbers as you can see and we may double check that is very different.
Are your results from a scan, right? So you could just put the RID inside the record (the format could be
Hi @lvca , That is results not from the scan I intentionally omitted scan (because there we have optimization of page prefetching) and fetched all records directly to prove that fetch by rid will be faster. You may call it a scan, but I did not use fetch of all records from a single page at once I loaded them one by one like if you fetch them from indexes or as suggested by Luigi from graph query.
And I think that 300% speed on POC on cold cache suggests that we should at least repeat the POC and measure performance again. And then if the result will be good to implement it as the new cluster.
About of your question of a full scan, yes it will be obviously faster but will require changes in the binary format of the cluster, will we add it in 2.2.x of to later version? But please note that, in general,it is non-goal of this optimization.
Also my question about graph traversal. What test did you use when you write that with page load time which is 0.17 ms. graph traversal takes 20 s. on 10 k pages. I think that this comparison of times is not completely correct and I want to profile this case to double check page load time.
Also, about my question with HDD, should we improve HDD speed too or should base our reasoning only on SDD ? Or more precisely if some case is fast on SDD , should we neglect the fact that on HDD it will be slow ?
The last question if time to load page is so small, why on my tests cold cache took 79950 ms without optimization, but hot cache (without optimization) took 2513 ms or other words was required 32 times more time without optimization. It directly means that time is needed to load pages 30 times bigger that time is needed to execute code on this pages. So if we shrink this time up to two times it will be noticeable, does not it ? (and actually this result is result of performance optimization on cold cache with new cluster, even better)
I defentelly looking forward to profile test about which you write that in this test page fetch takes much smaller time time than code exectuion, it is defenetelly perplexing fact for me.
This big time could be not due to the reading of the page, but on our code that executes the read. How many pages did you read in your test?
About HDD obviously I care about it, but it's always a balance between cost and benefits. With such low cost for a SSD today, the performance are order of magnitudes better. So improving performance on slow devices is not our priority now.
That's said, if in your earlier test it spent 79950ms, how much was the page load cost? I mean the Java I/O call? And how much is the cost of DiskCache, locks, etc?
Hi @lvca , in both test I read the same amount of records (whole pokec database). So the same code time is spent to load the same amount of records. How will it be slower on the same amount of records to execute the same amount of code. Both times for code execution should be the same, do not they ?
About how I did the test. I explain
And if I remember correctly I double checked , repeated the same sequence again, to do not produce resulst which will give false hope (it was some time ago so I am not sure how much times I repeated this sequence)
That's said, if in your earlier test it spent 79950ms, how much was the page load cost? I mean the Java I/O call? And how much is the cost of DiskCache, locks, etc?
I did not measure page load cost, but that is what I intended to do. Give me test which you referred too. I will profile it on cold and hot caches.
My question remains, I try to explain it better: if in your earlier test it spent 79950ms, how much was the page load cost? I mean the Java I/O call? And how much is the cost of inserting the loaded buffer into the DiskCache, managing locks, copying of memory areas, etc? I think that loading a page from I/O is X, then inserting that page in DiskCache is Y, where Y is at least 3x. Reducing the number of loaded pages with this proposal at half, it would improve less than improving how we store the loaded page in RAM with zero problems of compatibility and probably less effort on development and testing. This is just a theory, we need numbers.
@lvca I did not measure page load cost, but that is what I intended to do. Give me test which you referred too. I will profile it on cold and hot caches.
I answered few minutes ago ))
You could use the same test as you mention, adding a static AtomicLong counter that increment the nanoseconds of only the I/O api call, then on closing print the amount.
OK let me do it. I will do two tests then. One to measure fetching of records from a cluster (not a scan) and the second for example "short path" test on Wikipedia data. Both on cold caches and hot caches. That will be interesting. Will keep you posted.
You can use the StressTester to execute a shortest path on an existing database between all the vertices against all the others:
stresstester.sh -m plocal -d /temp/mydb -c 8 -w GSP
To put a limit of maximum 10,000 vertices for the Shortest Path workload do:
stresstester.sh -m plocal -d /temp/mydb -c 8 -w GSP:L10000
+1
Note: This proposal includes the ability to reuse space from deleted records, and other references to that specific enhancement request have been closed as a result. I just want to make the suggestion that if this proposal is not adopted, or is substantially delayed, and it's possible to implement the ability to reuse space from deleted records as a separate project, that feature would still be very valuable on its own.
+1 to what Eric said! :smile:
Scott
+1 Different topics mention the requirement to have a neat (=fast) mechanism to 'containerize' groups of data so that
This topic here discusses improvements on the clusters, among others the reuse of space from deleted records, but isn't there any corresponding change needed on the indexes ?
My concern is that -when you trash (with DROP CLUSTER) a complete, single cluster which contains old records- I expect that the indexes of those records are also deleted in the same action. But the indexes are stored in different files, no ? Will there occur gaps in these index files, that could/should be reused by new indexes ? Is there any such mechanism already in place that avoids that we have growing index files because space from removed indexes is not reused ?
Also, can it happen that the index-files in a distributed setup (multiple servers) become inconsistent with the list of clusters, from which just one cluster was dropped ? I refer to traditional relational databases with a 'partitioning' concept, where indexes are also stored in separate partitions. Dropping a data partition, also drops the corresponding index partition, so that everyting remains consistent.
Hi @rdelangh , Spaces in indexes are already reused , so if you drop some values, then new items will reuse the same space inside of index.
Hi All,
for keeping the version information inside the record id position there is also an alternative solution, that is encode the version using double, keeping the position as the integer part of the double, and the version as the decimal part of the double, to have the the biggest reuse possible of the disk position for a given id.
This may trick a bit the varint serialization of the rid, but it would be easily solved writing the 3 value in 3 separate varint, keeping the current advantages of the varint serialization.
Regards
Hi @tglman , yes that is interesting solution.
Maybe I'm not understanding your suggestion, but why does it matter whether you encode the RID as C:R:V vs C:R.V (cluster/record/version, with the second example using the 'double' concept)? As I understand your second sentence, you are essentially advocating C:R:V, so what is the advantage of using a double internally?
On Thu, Jul 28, 2016 at 9:33 AM, tglman notifications@github.com wrote:
Hi All,
for keeping the version information inside the record id position there is also an alternative solution, that is encode the version using double, keeping the position as the integer part of the double, and the version as the decimal part of the double, to have the the biggest reuse possible of the disk position for a given id.
This may trick a bit the varint serialization of the rid, but it would be easily solved writing the 3 value in 3 separate varint, keeping the current advantages of the varint serialization.
Regards
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/orientechnologies/orientdb-labs/issues/7#issuecomment-235913066, or mute the thread https://github.com/notifications/unsubscribe-auth/AAp1BPZSMznR53hfKMmqAnJnT3EuwhKzks5qaL3FgaJpZM4JHEa9 .
@tglman I think you used double, but you wanted to say "decimal", because Double by definition loose precision (IEEE 754 floating-point "double format" bit layout).
IMHO this solution does not make much sense as long as we use varint to represent single parts of the RID
Reference:
https://github.com/orientechnologies/orientdb-labs/blob/master/OEP_7.md
Summary:
Create cluster which allows accessing only single page for the most of the times when we need to fetch page by RID.
Goals:
Refactor cluster data structure which allows to:
Non-Goals:
Create optimization on query layer wich allows using a new cluster to improve full scan queries.
Success metrics:
Motivation:
Most of the queries which are executed on database may be split into two types:
All those queries load a massive amount of pages during loading of records by rids. Instead of sorting of rids by indexes of pages which contains them and then loading of all records that belong to the single page
at once we load two pages for each rid.
Description:
We already implemented data structure which may be used to overcome all of the issues listed above, that is the data structure which is used inside of OPaginatedCluster itself, but we never used all power of this data structure.
Inside of paginated cluster data structure, each page contains its rid-record_physical_position_map which embedded inside of it. So when we create a new record we add record to the page and inside of the page we create a link to a real physical position of the record on disk. There will be three types of links:
When update is performed following actions are done:
Entry of embedded rid-position map consist of two items internal pointer which consist of 2 bytes to make references inside of 65k page and delversion (more details later) which consist of 8 bytes and is positive if a link is not a tombstone and negative in other cases. So common space which is consumed by reference is 10 bytes. If we need to create external reference (which should be pretty rare case) we create an internal reference to the record which serves as an external reference to another page where the real record is allocated.
In a case of creation of records, if there is a tombstone inside of rid_positions map we reuse this tombstone as rid for a new record. Such approach allows us to reuse space of removed records completely as opposite to now. But RIDs should be unique so to keep them unique new field which is called delversion and consist of 8 bytes is introduced when a record is deleted we increment this field and then make it negative, but when the tombstone is reused we make this field positive again.
It implies that we need to change the format of RID by addition new field which will be called delversion. After this change, format of rid will be following:
If we will create let's say 3 records inside of cluster with id=10 using first page as container of those records, their rids will look like following:
[#10:65537:0, #10:65538:0, #10:65539]
, so as I can see adjacent clusters are palced on the same page.This format of cluster gives following advantages:
Record preallocation functionality
Right now to avoid the cases when to get rid of record in graph database in case of recursion of serialization we need to store empty a record we perform so-called record preallocation, it means that we allocate a slot in rid-position_map but do not write real data. Then when the record will be serialized we store real record inside data file.
To keep the same functionality following is proposed because the only unknown value is rid value we may calculate size which is a need for a record before we will know real values of rids, then we may book this space in a data file and do not write real data. But when we will know rids we will write real record content, as usual, which effectively will be the same record preallocation procedure.
Alternatives:
Keep original cluster design.
Risks and assumptions:
Impact matrix