Default Clustering in OrientDB 2.2b

orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.

https://orientdb.dev

Apache License 2.0

4.73k stars 869 forks source link

Default Clustering in OrientDB 2.2b #5773

Closed healiseu closed 8 years ago

healiseu commented 8 years ago

Hi, I have downloaded OrientDB 2.2b. I have a problem with the clustering of records. I am reading the manual of 2.2 and it says

By default, OrientDB creates one cluster for each Class. Starting from v2.2, OrientDB automatically >creates multiple clusters per each class (the number of clusters created is equals to the number of >CPU's cores available on the server) to improve using of parallelism. .... While the default strategy is that each class maps to one cluster, a class can rely on multiple clusters.

I have created a new Graph Database. Then I checked the schema in OrientDB studio and I found that each class is mapped to two clusters. This was not the default case in previous versions of OrientDB.

Since I am playing a lot with RIDs, it is important to have the cluster ID constant because it also represents a Class. Could you please instruct me how to achieve the default clustering that I used to have in OrientDB 2.1.x ?

Thank you

PS1: Although it is difficult to explain it now, I think it would be better If you had a separate identifier for each class instead of being dependent on cluster numbers.

PS2: OrientDB 2.2b is faster !

healiseu commented 8 years ago

O.K this is what I tried to solve my problem

OrientDB Studio / Database Management (DB) / Configuration

I have set manually clusterSelection to Default then clicked on SAVE I tried to create a new class but it is still created with round-robin

What I would like to do is to set this with an SQL command such as ALTER DATABASE CLUSTERSELECTION Default

Command was executed successfully but the default cluster selection strategy for a newly created class remained round-robin.

healiseu commented 8 years ago

Solution

I had to compare it with OrientDB 2.1.x The configuration parameter that needs update is the minimumClusters. This can be done with the following SQL command at the database level.

ALTER DATABASE MINIMUMCLUSTERS 1

BTW: This Create Class manual page has a reference to this.

By default OrientDB creates 1 cluster per class, but this can be changed by setting the property >minimumclusters at database level.

Nevertheless the default for 2.2b is that OrientDB creates 2 clusters per class

lvca commented 8 years ago

Using multiple clusters, on multi-core machines, increase read/write performance.

healiseu commented 8 years ago

Sure, but as I said I am playing with RIDs and you made them dependent on these clustering numbers. Ideally I would like to have a constant CLASS number to represent my class. See my work with AtomicDB and the 4D vector that represents ANY DATA ITEM in the database without names !

DeividasJackus commented 8 years ago

@lvca distributed mode and splitting data into new clusters does indeed assign new RIDs to records. Should this not be expressly noted in the docs?

The current conceptual description of RIDs gives an impression as if they're permanently persistent, which is not the case. In fact this might cause severe trouble if RIDs are exposed outside the application (e.g. users sharing links containing a RID that might change).

healiseu commented 8 years ago

@sinfex I think this is a good reason on why I suggested that you should also have independent class numbers that can be related to cluster numbers. You may find many cases that you need this logic. In fact in my metadata framework, S3DM/R3DM, it is fundamental to have a constant integer to represent each class.

smolinari commented 8 years ago

I personally would never allow users outside of ODB (or outside of the application) to see RIDs. It is clearly an ODB internal id and it would be an architectural mistake to have them "out in the open" or to base any kind of logic on them. They belong to ODB and at most, can only be used as what they are, references to data records.

Scott

DeividasJackus commented 8 years ago

@smolinari what about when you decide to develop a CRUD API? Why not have your application make RIDs url-safe (e.g. transform #12:0 into 12_0) so you could work on records directly?

GET /accounts/11_1234/invoices/12_2525 seems clean, concise and allows the fastest record access possible. If you're developing an API using JSON, I'm unable to see this being an architectural mistake.

Validate record IDs, classes and access. Validate input, use CONTENT/MERGE in CREATE/DELETE VERTEX and UPDATE calls, filter output. Leanest layer between OrientDB and an SPA/app possible?

a-unite commented 8 years ago

can only be used as what they are, references to data records.

To support what @sinfex said. If you will use syntetic ID - while it adds overhead and needs to address records by extra index lookup - it still means you are exposing data outside, except the fact that RIDs are incremental, so it is easy to guess ID of another record. In addition to using strong access rights rules, what we do to decrease temptation to check random RIDs - is hashing them in every API response, so URLs look like GET /data/NkJs4/

smolinari commented 8 years ago

@sinfex - yes, you can and should use the IDs for references (as I said). But, like a-unite said, I too believe the RIDs should be obfuscated. So, we also hash the RID. A good hashing system is also hashids.

http://hashids.org/

It can take both numbers of the RID and hash them.

We are also working with an 11 letter hash. It allows for 32K clusters and around 7 trillion records (before the algorithm jumps to making hashed Ids with 12 letters). The only reason we have gone with 11, instead of starting with a smaller number, is we want uniformity in the hashed Id.

Scott

DeividasJackus commented 8 years ago

@smolinari the only reason I did not reference HashIDs is because it is not a hashing system, it's an obfuscation library. Obfuscation which is reversed without knowing the salt rather easily. I would therefore disagree with your suggestion to use it for absolutely anything but ID uniformity. Uniformity, which to an extent RIDs can give you all on their own.

Your application needs to make sure the requesting entity has access to ID aazwsx1234 prior to performing anything on it just like it would have to verify access were the requested ID equal to 12_345. From where I stand, unless you use proper asymmetric encryption, all you're gaining from your approach is CPU overhead and eye candy. In fact, if I'm a perpetrator getting a whiff of hashids, I'm now only more intrigued to hack your system because I know you're trying to hide something in an insecure manner.

I remain unconvinced that there's real application to attempting to hide your RIDs unless you intend to harden the process of establishing a rough size estimate of your data set. If you're going for security through obscurity, you might as well just return 404 Not Found instead of 401 Unauthorized when someone attempts to access a RID they're not allowed to.

smolinari commented 8 years ago

If a hacker wants to find the real RID, fine. Completely encrypting the RID isn't the intention of hashids. It is only obfuscation. Obviously, there has to be some form of access controls above and beyond the RID obfuscation too.

Scott