[OEP 12] Go from "Clusters" to "Shards"

smolinari commented 7 years ago

Summary: From the start of ODB, the feature for the logical single unit for separation and location of data was called a "cluster". This was, unfortunately, an incorrect term for this concept. If, in fact, the analogy was taken from a disk operating system, where there are clusters, then the better term should have actually been "sector". But, this term would make even less sense, when discussing a database technology.

Within the distributed database concepts of today, a cluster is considered the entire distribution of the database.

https://www.techopedia.com/definition/17/clustering-databases

Definition - What does Clustering mean?

Clustering, in the context of databases, refers to the ability of several servers or instances to connect to a single database. An instance is the collection of memory and processes that interacts with a database, which is the set of physical files that actually store data.

Clustering offers two major advantages, especially in high-volume database environments:

Fault tolerance: Because there is more than one server or instance for users to connect to, clustering offers an alternative, in the event of individual server failure. Load balancing: The clustering feature is usually set up to allow users to be automatically allocated to the server with the least load.

The proper term for the unit of data storage and location in a database scenario is a "shard".

Here the definition:

https://en.wikipedia.org/wiki/Shard_(database_architecture)

A database shard is a horizontal partition of data in a database or search engine. Each individual partition is referred to as a shard or database shard. Each shard is held on a separate database server instance, to spread load.

Goals: Throughout the code and documentation, replace the term "cluster" with the term "shard", when referring to a single grouping of data and its location.

A class is made up of one or more shards.
A database is made up of a number of classes.
A database cluster is made up of 2 or more nodes, which can hold any number of shards.
A distributed database has sharding and shard balancing
A distributed database can only be held within one cluster of nodes. They in turn, could be distributed globally.

In the end, this change will raise the level of professionalism for OrientDB dramatically, because the docs and code logic will conform to the standard terms used for distributed database technology.

Non-Goals: None currently.

Success metrics: None currently.

Motivation: In order to be much more conform to today's terminology for distributed database technology, the term "cluster", as in "clusters make up a class", should be exchanged with the term "shard". "shards make up a class". This will enable new DBAs and programmers to find themselves faster within the ODB architectural landscape. It will also enable discussions around database data separation and locality and server architecture to be less ambiguous or confusing.

In the docs, the mixup is evident, when the discussion is wrapped around clusters, as single units of data groupings and a clusters of servers doing sharding.

Example:

When a query involves multiple shards (clusters), OrientDB executes the query against all the involved server nodes (Map operation) and then merge the results (Reduce operation).

The mention of (clusters) shouldn't be necessary.

See also the discussion with @luigidellaquila.

https://groups.google.com/forum/#!topic/orient-database/2_iTzne1eXo

Here a good example of the mixup in terms: https://github.com/orientechnologies/orientdb/issues/6608

Description: See above.

Alternatives: None.

Risks and assumptions: Here is an example of ODB code with the term Cluster replaced with Shard: https://gist.github.com/smolinari/609335f498c456bd97c3ff86ad6136db

It took me all of 1 minute to replace the terms in this single file. I am not able to judge if the changes make sense directly, but it seems they would be ok.

Doing the same with the docs would be just as easy, however, the changes in the docs would need to be examined a bit more in depth to be deemed correct from a comprehension standpoint.

There are already two people willing to work on the docs, including myself.

Impact matrix

Note: I added two points to the matrix below.

[ ] Storage engine
[x] SQL
[ ] Protocols
[ ] Indexes
[ ] Console
[ ] Java API
[ ] Geospatial
[ ] Lucene
[ ] Security
[ ] Hooks
[ ] EE
[x] General Code Base
[x] Documentation

Eric24 commented 7 years ago

+1

diegomtassis commented 7 years ago

+1

However, you point to orientdb/issues/6608 (which I created) as a mixup of the 2 cluster word usages, but all the times the word is used there it has the semantics of shard.

luigidellaquila commented 7 years ago

Hi all

+1 to the general idea, the word cluster is definitely misleading. Anyway I don't like shard very much, as it has the same disadvantages of cluster: a shard is a (logical or physical) partition of data, so for example I can have a DB with four classes [A, B, C, D] and have two shards, one made of [A, B] and the other one made of [C, D]. Obviously in this case you won't have two files, but at least four. So calling a file shard will create the same exact confusion as for cluster.

What we are discussing here is the naming for files that contain data, IMHO the best thing is to give them a name that:

option 1: is specifically related to files, like datafile
option2: is independent from the implementation (ie from the fact that we are talking about files) AND completely unrelated to classical naming distributed architecture; a good candidate could be data segment; for the record, it was the original name of current clusters

My 2 cents

Luigi

smolinari commented 7 years ago

@diegomtassis,

You did a great job describing the problem despite the terms. I just added it because I found the whole issue a good example of the terms being mixed in a problem that should actually only concern one or the other, nothing more.

@luigidellaquila,

Thanks for your insights. I see what you mean. "Data Segment" sounds good, but isn't the most elegant solution. Maybe just "segment"?

According to the sharding docs, for a distributed database, the clusters seem to be used just like shards. Is this true? Are there plans to use the current cluster concept and sharding differently, as in sharding as you explained it could be?

On the other hand, I personally don't think trying to relate the physical form of a cluster (i.e. files) to the logical form within ODB is necessary. It really doesn't matter how a cluster/ data segment is stored physically to understand ODBs logical architectural structure.

The only other thing I can think of is "chunk", like MongoDB has, which is a smaller container under a shard. But, that sounds even less elegant and its used by MongoDB. LOL! :smile:

How about just simply "partition"?

Scott

luigidellaquila commented 7 years ago

+1 for segment, also chunk is not bad, but I agree with you that it's less elegant.

I agree with you that relating the name to the implementation is not a good idea (btw, I think in the future we will have to re-think the 1-1 association of file-segment)

Thanks

Luigi

a-unite commented 7 years ago

+1 for segment

smolinari commented 7 years ago

Yeah, I am warming up to segment too. Let's try it in an example in the docs. :smile:

Segments

A Segment is a place where a group of records are stored. Like the Class, it is comparable with the collection in traditional document databases, and in relational databases with the table. However, this is a loose comparison given that unlike a table, segments allow you to store the data of a class in different physical locations.

To list all the configured segments on your system, use the SEGMENTS command in the console:

orientdb> SEGMENTS

SEGMENTS: -------------+------+-----------+-----------+ NAME | ID | TYPE | RECORDS | -------------+------+-----------+-----------+ account | 11 | PHYSICAL | 1107 | actor | 91 | PHYSICAL | 3 | address | 19 | PHYSICAL | 166 | animal | 17 | PHYSICAL | 0 | animalrace | 16 | PHYSICAL | 2 | .... | .... | .... | .... | -------------+------+-----------+-----------+ TOTAL 23481 | --------------------------------------------+

Understanding Segments

By default, OrientDB creates one segment for each Class. Starting from v2.2, OrientDB automatically creates multiple segments per each class (the number of segments created is equals to the number of CPU's cores available on the server) to improve using of parallelism. All records of a class are stored in the same segment, which has the same name as the class. You can create up to 32,767 (or, 215 - 1) segments in a database. Understanding the concepts of classes and segments allows you to take advantage of the power of segments in designing new databases.

While the default strategy is that each class maps to one segment, a class can rely on multiple segments. For instance, you can spawn records physically in multiple locations, thereby creating multiple segments.

Not too shabby. But this is the place in the docs that has always bothered me the most.

Sharding

NOTE: Sharding is a new feature with some limitations. Please read them before using it.

OrientDB supports sharding of data at class level, by using multiple segments per class, where each segment has own list of server where data is replicated. From a logical point of view all the records stored in segments that are part of the same class, are records of that class.

Follows an example that split the class “Client” in 3 segments:

Class Client -> Segments [ client_usa, client_europe, client_china ]

This means that OrientDB will consider any record/document/graph element in any of such segments as “Clients” (Client class relies on such segments). In Distributed-Architecture each segment can be assigned to one or multiple server nodes.

Shards, based on segments, work against indexed and non-indexed class/segments.

Multiple servers per segment

You can assign each segment to one or more servers. If more servers are enlisted the records will be copied in all the servers. This is similar to what RAID stands for Disks. The first server in the list will be the master server for that segment.

Hmmmmm......?

Reading that, the I feel the term should be shard, unless the sharding within ODB is going to change. If that is the case, then this chapter would have to be rewritten anyway.

Let me try and rewrite that same part with shard as the term.

Sharding

NOTE: Sharding is a new feature with some limitations. Please read them before using it.

OrientDB supports sharding of data at class level, by using multiple shards per class, where each shard has its own list of servers. The difference to normal shards in ODB is shards could also be duplicated across different servers. From a logical point of view, all the records stored in shards that are part of the same class, are records of that class.

Follows an example that split the class “Client” in 3 shards:

Class Client -> Shards [ client_usa, client_europe, client_china ]

This means that OrientDB will consider any record/document/graph element in any of such shards as “Clients” (Client class relies on such shards). In Distributed-Architecture each shard can be assigned to one or multiple server nodes.

Shards, based on shards, work against indexed and non-indexed class/shards.

Multiple servers per shard

You can assign each shard to one or more servers. If more servers are enlisted the records will be copied in all the servers. This is similar to what RAID stands for Disks. The first server in the list will be the master server for that shard.

This sentence makes no sense.

Shards, based on shards, work against indexed and non-indexed class/shards.

That would need to be reworked. I also have no idea what it means in the original form. :smile:

Scott

orientechnologies / orientdb-labs

[OEP 12] Go from "Clusters" to "Shards" #12