orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.73k stars 869 forks source link

Schema driven serialization #1890

Closed lvca closed 10 years ago

lvca commented 10 years ago

While we're designing the new OrientDB binary serialization format a new idea came in my mind. OrientDB can work in schema-free mode, but most of the time users have a schema that can extend with particular cases on single records, but are mostly schema based. Example:

|id|employee|amount|quantity|total|

With classic schema-less binary serialization proposed in #681 all the field names (metadata) go in the document. If the document is: { id : 2232324, employee : 'Luca', amount : 1000, quantity : 4, total : 4000 }

Using this example means we store also metadata as:

|id|employee|amount|quantity|total|
(2+8+6+5+8) * 2 (because UTF-8) = 58 bytes

And fields value = id:4, employee:8, amount:8, quantity:4, total: 8 = 32 bytes

So the metadata are bigger than data itself! And this metadata is replicated for all the records even if all the record have the same structure. So this is the idea: use the schema's metadata for serialization avoiding to store them in each document. So the metadata OProperty will have 2 new fields:

In this way the record will be much smaller and metadata must be not marshalled/un-marshalled everytime.

How to manage variable-sized properties?

Store only the offset (integer=4 bytes) where the variable-sized property value starts. Example of "employee" field:

fields |      id|employee|amount|quantity|total|
offsets|      0 |       4|     8|      16|   20|
length |      4 |       4|     8|       4|    8|
value  ||2232324|      28|  1000|       4| 4000|Luca

Note that in employee field the value is 28, the first position after the fixed sized structure (=28 bytes).

How to manage schema-mixed mode?

Append non schema based properties at the end of the fixed length record. by using the same schema-less binary serialization implementation.

Pros & Cons

Pros:

Cons:

If the structure is well defined with few exception schema-driven wins, unless the schema changes are very frequent.

andrii0lomakin commented 10 years ago

It is nice idea. But guys what I propose is to create full spec before implementation. It will help us to see full picture and that is useful for users too.

lvca commented 10 years ago

Yes, this isn't an implementation is part of the spec.

lvca commented 10 years ago

In this way we could support partial serialization too. For example if the user does:

document.field( "salary", document.field( "salary" ) + 100 );

We could only write the property, float in this example = 8 bytes, without serialize all the content and write all the data on disk. So at this point the storage API should be refactored to support partial updates with a offset + length

andrii0lomakin commented 10 years ago

@lvca Partial update approach leads to massive usage of random IO which means, that all updates will be slow. For sequential update on the disk it does not make much difference whether it is 100 bytes or 200 bytes, but it is huge difference whether it is sequential update or head seek. Quotation from IBM research ". . . The time to complete a [read of a single page] could be estimated to be about 20 ms (assumes 10ms seek, 8.3ms rotational delay, 1.7ms read) . . . The time to perform a se- quential prefetch read [of 64 contiguous pages] could be estimated to be about 125ms (assumes 10ms seek, 8.3ms rotational delay, 106.9ms read of 64 records [pages]), or about 2 ms per page. " (Wodnicki, J.M. and Kurtz, S.C.: GPD Performance Evaluation Lab Database 2 Version 2 Utility Analysis, IBM Document Number GG09-1031-0,) so sequential writes (or reads it does not matter) 10 times faster or other words partial updates will be 10 times slower.

Also did you read my proposal about lock free cluster https://github.com/orientechnologies/orientdb/issues/1600#issuecomment-30494233 which allows to do sequential updates without random IO slowdown ?

Also you suppose that partial updates it is only write of lets say 10 bytes, but disk cache or mmap, does not matter which, they operate by pages which means that update of single record will require to write whole page to the disk. So 90% of write time will be wasted and 90% of loaded page memory will be wasted, but using append only updates we use full page capacity because whole page will be appended by updated records.

lvca commented 10 years ago

Partial update means serialize a fraction of the record, so it's obviously faster than serialize the entire record. About writes depends. We've 64K page size by default, but many OSs use 4k only, this means we've 16 physical pages in memory. So if I change an integer, 4 bytes, following current approach it writes 64k = 16 pages. With partial update you've the chance to write 1 page only instead of 16, so this is supposed to be faster. You would be right if we use page size = os page size, then it shouldn't make any difference.

andrii0lomakin commented 10 years ago

@lvca it is true for in memory updates, but what do you think about random IO ? If partial update takes 0.2 ms vs 0.8 ms but we should spend 20 ms for disk seek how will it be faster ? Do you want from me benchmark of random io updates of 10 bytes on random access file, vs append only update of 500 bytes record as proof ? It is quite easy to do.

lvca commented 10 years ago

@laa You need seek also for the page. Seek is the same for both cases.

andrii0lomakin commented 10 years ago

@lvca No. Because of write cache. Which will update all writes for one shot for append only updates. Lets say we update 1000 pages. So for partial updates we have: disk seek for read page and one disk seek for each page to write. so 2000 disk seeks.

For append updates: disk seek for read page and only one disk seek for all pages to write (because of sequential update). so we have 1001 disk seeks.

Of course some of these disk seeks will be amortized by disk cache any way, but even if we will have 40% of disk seeks really done so we have additional 400 disk seeks more than for append only updates or 8000 ms slower.

Also one more note all record updates will be in one pages so we will load 1000 pages but will update only portion of it like 200. I can provide proof of it on prototype.

andrii0lomakin commented 10 years ago

I think we should continue after I will write documentation for current caching system.

lvca commented 10 years ago

Append updates: 1 seek is true only if you have 1000 pages all in sequence, but this couldn't be true (see below).

However updating all sequential pages is quite rare, so you've to do seek. Partial updates could work in memory and the storage could keep the lower and higher bound of updates, then storage could flush only the range between lower -> higher. In the case you're updating multiple pages (that are already ordered by disk offset) you could avoid partial page update only in case you've written the page right before the current one avoiding seek.

But other considerations are: 1) file system fragmentation, it's not always true pages are sequential, so in case of fragmented pages the seek occurs 2) this is the most important. I could be wrong on this because I don't know how is implemented, but IMHO when you've an algorithm with multiple write calls, the OS could manage each write like atomic, and serve concurrent read/write so seeks happens again between writing of sequential pages. Example:

for( Page p : orderedPages ){
  p.write();
  // HERE THE OS COULD DO OTHER I/O OPERATIONS!
}

Between each p.write() the OS could do other operations on the drive, so seek is needed again. The only case seek could be reduced is to: 1) order pages, like now 2) write in one shot all the sequential pages like it would be just one

andrii0lomakin commented 10 years ago

@lvca all what you wrote I knew before. And append only updates do allow to write all 1000 pages in one shot, and this can be true. All popular DBMS use append only updates instead of in place updates. Oracle, MySQL, PostgerSQL, Cassandra they all use similar approach that is why they have "Compaction", "Purge" , "Vacuum" background tasks, tasks have different names but the same target do append only updates and then perform defragmentation in background.

Why you do not want to have a prof, just lets do in place and append only updates and see performance difference in MT environment. We can have 100 000 pages (6gb) and lets say 1Gb disk cache, then we will update pages content in place like 4 bytes, and do append only updates like 32 bytes (I use your example) and do update of 10 000 pages (650 Mb) using 8 threads, using the same locking approach ?

lvca commented 10 years ago

Append only it's good in general, this thread was about partial updates that doesn't mean in place updates, but only optimizing writes avoiding writing of full pages + faster serialization.

andrii0lomakin commented 10 years ago

This is comment from shadders.del@gmail.com :+1: "This is probably going to be a stupid question because the solution seems so obvious I must have missed something fundamental.

I found OrientDB when I gave up on MongoDB due the issue of storing field names in every document (for a lot of my data the field names are larger than the data itself). I just came across issue #1890 and happy to see that Orient considers this a priority but I don't quite understand the need for such a complex approach.

Why not simply maintain an internal index of field names and store the index? It wouldn't really matter if you had different classes with the same field name since the name is all you are interested in. To further compact things you could use a format like google protobufs 'varint' type. If you altered the varint format so the first byte 'grouping' was 16 bits rather than 8 then you'd have 32k field names available before needing to expand (which would cover an awful lot of uses cases).

The lookup would be as trivial as an array lookup and any overhead would be more than offset by the benefits of being able to cache many more records in memory due to the space savings. Another potential advantage would be that you only ever use one instance of each field name String and vastly improve any map lookups that are done internally. If the current format writes the actual field name as a string then every time a field is read it's reading a new string so for every field * every record where a map lookup is required it must compute hashcode and run a manual char by char equals(). 3 traversals of the string saved on the first lookup (1 for hashcode and 1 for both strings) and 2 for subsequent lookups.

On the client side I suppose there is the issue of whether the client should keep the entire lookup table in memory. It could be passed portions of it as needed and use something like a Trove map for lookups. Not quite as fast as an array lookup but again I would imagine the savings in memory, bandwidth etc would more than offset the cost.

I must be missing something?"

lvca commented 10 years ago

the pros will be huge:

So I'm pretty positive and I expect a huge boost on performance with 2.0!

steffenyount commented 10 years ago

I was looking a Parquet ( http://parquet.io/ ) a while back and they are doing some interesting things with schemas and graph serialization based on the algorithms worked out for Google's Dremel. These algorithms allow them to store an Object tree/graph structure in an efficient column oriented format.

I don't know if the algorithms would be compatible with OrientDB, but if they were it could be pretty amazing to see column-store levels of performance on compression and aggregation queries...

If you're not familiar with the algorithms here's a pretty good intro link: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

and the original Dremel paper: http://research.google.com/pubs/pub36632.html

andrii0lomakin commented 10 years ago

That is exactly what I meant when proposed https://github.com/orientechnologies/orientdb/issues/681#issuecomment-28466948 MBSM storage format. performance comparable with column-store levels on SQL queries. Will look close on your links.

lvca commented 10 years ago

@steffenyount column storage model is very different than document/graph database models. With column model it's like if you do a pivot of the classic table records/fields. This would means don't load the record #13:4 anymore but rather load all the fields of record #13:4 as separate records. Am I missed anything?

steffenyount commented 10 years ago

I think of the relationships between the following different types of data stores like this:

  Row Store  -> Column Store
     |                |
     V                V
Object Store -> Dremel Store

Some advantages that Column stores have over Row stores:

  1. Selecting few columns from objects with many columns does not require loading data from the unreferenced columns and their data blocks.
  2. Columns stores implicitly group data by type, providing increased compression opportunities with self-similar data values. Meaning more data fits in memory and less I/O is required to process the same number of field values.
  3. Table range scan/aggregation operations benefit significantly from contiguous column data since the adjacent values loaded into CPU cache lines are typically the next values of interest.

How Dremel works:

To seek and iterate through object records in a Dremel store, a fleet of state machines is initialized (one per selected field), they are driven to seek the target record's offset within their own column data in parallel, from there the object records can be reassembled with the field values reported by each of these state machines in turn, and then the next object, and the next, etc.

In the testing done for the Dremel paper (linked in previous post) they claim: "A natural question to ask is where the top and bottom graphs cross, i.e., record-wise storage starts outperforming columnar storage. In our experience, the crossover point often lies at dozens of fields but it varies across datasets and depends on whether or not record assembly is required"

Why I am interested in this:

One area where Column stores are known to shine is in Analytics/OLAP/Star Schema/Data Mining type applications, where you typically select some cut of your data (e.g: Yesterday, San Francisco, Women) and then aggregate (sum/average/count/min/max) over the data points in that selection. Another thing you'd typically want to do is look at how your selected cuts roll up into a bigger grouping (e.g: Yesterday, California, Women).

Hierarchical roll-ups are common, but the existing systems tend to get very clumsy when you start to work with ragged and arbitrary roll-up hierarchies. The classic bridge table solution for this requires using O(n^2) rows to include all pair-wise relationships found in an arbitrary tree hierarchy with n edges.

I wonder if it is possible marry the graph DB concept with the columnar storage concept to get both efficient data selection in ragged arbitrary hierarchical roll-ups and efficient aggregation of the field data selected.

andrii0lomakin commented 10 years ago

Hi guys once again want to write that I supported before and support now idea which was explained by @steffenyount . That I was writing in internal letter related to binary serialization I will provide details of MBSM and will closely look on Dremel till end of next week. + append only cluster support (we can not separate cluster design and binary serialization on this stage, because such recrord organization should affect locking and durability subsystems, may be later will see how we can separate them) wof a lot of work.

lvca commented 10 years ago

@laa So could you explain how you'd like to manage records and fields? Let's use some graphical representation, ASCII art or even an uploaded hand-written picture,

andrii0lomakin commented 10 years ago

Yes,sure.

On Thu, Mar 20, 2014 at 1:15 PM, Luca Garulli notifications@github.comwrote:

@laa https://github.com/laa So could you explain how you'd like to manage records and fields? Let's use some graphical representation, ASCII art or even an uploaded hand-written picture,

— Reply to this email directly or view it on GitHubhttps://github.com/orientechnologies/orientdb/issues/1890#issuecomment-38155678 .

Best regards, Andrey Lomakin.

Orient Technologies the Company behind OrientDB

acmeguy commented 10 years ago

Hi,

I'm not sure if it's relevant for this discussion but I would like to add two things to consider.

  1. Immutable, append-only, records would never require any oversize. If this new serialization would benefit from knowing such entries then please consider them.
  2. I'm using Snappy to compress verbose/additional event information and storing them in a Custom field. I see that some document stores (CouchDB snappy) are using fast compression for longer field values. I also found this quite interesting: https://www.google.is/url?sa=t&rct=j&q=&esrc=s&source=web&cd=26&cad=rja&uact=8&ved=0CEcQFjAFOBQ&url=http%3A%2F%2Fojs.academypublisher.com%2Findex.php%2Fjcp%2Farticle%2FviewFile%2F041212631274%2F1331&ei=j_gvU6zLI4ud7QaruYGQBg&usg=AFQjCNHhEugDAS1ybKw0eFHvkh_ujfS4nA&bvm=bv.62922401,d.ZGU

Very best regards, -Stefán

acmeguy commented 10 years ago

I guess my point is this.

If you are about to change the serialization to improve efficiency then please take it as far as possible/reasonable.

Wish I was in a position to help.

Best wishes, -Stefan

lvca commented 10 years ago

Hi Stefan, I'd prefer to start the binary serialization asap (because our biggest bottleneck now is this!) by supporting a pluggable serializer+version. In this way we shouldn't wait to find the best algorithm ever (it could take months and many POCs) but we can start with something reasonable faster and then improve the algorithm with new version maintaining retro-compatibility.

acmeguy commented 10 years ago

Understood, thnx.

tglman commented 10 years ago

this is now dependent to: https://github.com/orientechnologies/orientdb/issues/2579

and can be implemented using the same format of Record Schemaless Binary Serialization replacing the header with a new header that use a global property identifier.