[OEP_14] Namespaces - Githubissues

SDIPro commented 7 years ago

Summary:

We need a way to isolate user data on a large scale while still being able to use class inheritance and polymorphism. Think multi-tenancy.

A namespace is the current way to describe the concept, but we're open to better names/concepts.

Goals:

Isolate user data so that all related clusters, indices, etc. can be moved together, efficiently, and can reside in a different physical location (disk partition).
Have a way to limit CRUD operations and queries to the records residing in the namespace only.
Support class inheritance of classes residing within the namespace from classes in the top-level domain.
Support separate indices created against classes residing within the namespace, storing them inside the namespace, to not pollute the top-level indices.
Support users and roles per namespace.
Motivation:

The primary motivator is to support multi-tenancy efficiently. Currently, we support multi-tenancy in three different ways:

Use a separate database for each client.
Use a separate cluster for each client's class.
Use an identifier within each record to mark it belonging to a specific client.
Separate Database

There are multiple problems with using separate databases for this, though it does provide good data isolation as well as better security:

Even a small number of clients (100) would make maintaining so many different databases a huge nuisance. Many customers have 1000s of clients.
Making a small schema change would require updating all databases.
Accessing a client's data would require opening a new database connection for every client. This is not efficient and prevents using a collection pool.
Separate Cluster

Today, we can store each client's data in a separate named cluster. The primary problems with this are:

Querying via cluster name loses the advantage of using polymorphic queries on a base class.
Limiting records to a single cluster prevents the storage engine from spreading the load across multiple clusters when running with multiple threads.
Likewise, limiting records to a single cluster per class affects the distributed systems ability to partition data efficiently.
It's currently not possible to place an index against an individual cluster.
Unique Identifier

Using a unique identifier property to separate client records is a common pattern. However, there are multiple problems with it as well:

If a common class is used by all clients and each client has millions of records of data, a single class could end up storing billions and billions of records. This ultimately reduces efficiency.
That unique identifier must be indexed, and with billions of records, retrieving the records for a single client or adding new records to the index can take longer than it should.
If all client data is stored in the same class clusters, there is no way to isolate the records on a per-client basis.
Description:

Implementation

One possible solution is to create quasi sub-databases under the main directory for a database, each representing a namespace:

/databases/MyDatabase

/American

/Lufthansa

/Southwest

Each sub-directory representing a namespace could contain all the clusters and indices, etc., for a specific namespace without polluting the top-level database files (making them grow extremely large).

You could still use the top-level database classes (especially with inheritance) but your CRUD ops and queries could be isolated to just the namespace content.

Clusters would never be shared between namespaces nor between a namespace and the top-level database files. What this means is that if cluster #20 is used in the Lufthansa namespace, #20 is not also used in the top-level database nor by any other namespace.

But, you could still link to records that reside between a namespace and the top-level database.

Queries

The syntax needs to be chosen (Lufthansa:MyClass, Lufthansa.MyClass, or Lufthansa::MyClass or whatever), but "select from Lufthansa:MyClass where Age > 50" would only search on the MyClass class contained within the 'Lufthansa' namespace directory.

Using the normal "select from MyClass..." would only return the records contained in the top-level directory.

Inheritance

A class created in a namespace should be able to inherit from a class that's been created in the top-level database. The advantage to this is being able to produce schema that's used by all clients. Modifying the schema happens in only one place instead of in 1000 different databases.

Data Isolation

Data isolation is easily supported since all data related to a namespace resides in its own directory.

The namespace could even be on a different drive if the sub-directory is linked to a different partition.

Thinking of the distributed mode, another advantage to using a namespace is that all client data could reside on a single node (not including replication).

Security

It would be nice if each namespace could have its own set of users and roles, especially when using ORestricted.

Alternatives:

We do nothing.

Risks and assumptions:

Since implementing this concept properly will involve almost all modules, the risks are the time required to do it properly and then missing a piece and either introducing bugs as a side-effect or missing support in a critical area.

Impact matrix

In one way or another, this will impact almost all systems.

smolinari commented 7 years ago

I definitely vote :+1: for this, if it is necessary.

I also have two additional points, one about this suggestion and another a bit off-topic.

First the off-topic. There is also one other issue from a multi-tenancy perspective, which I see as a deterrent to using single databases with multiple tenants. It's the limit on the number of classes ODB offers (which I believe will be improved in 3.0??) along with the new "cluster per core" change in 2.2. This change really reduces the ability to offer a client a good number of classes to expand and customize their database.

Let's say we have an 8 core system. That reduces the number of available classes to about 4000. 1000 classes per client would be a minimum I'd want to offer and to give clients enough room to expand. This means only 4 clients can be added to a single database. That isn't really an acceptable scenario. Our plan was to set the cores per cluster to something smaller, but this means that setting is set for the life of the database (AFAIK), which is sort of a no-go too.

The second point is, I do also believe there is one other possibility for partitioning data within an ODB database, which I was considering. Partitioning could be done though the user access system that ODB offers.

So, to set up partitions, a DBA would simply (and using your example)

Set up an "airlines" database
Set up V and E and any other class to extend ORestricted.
Set up the "airline" users, (i.e. "american_admin", "lufthanse_admin", "southwest_admin") with admin permissions (or whatever is needed) and connect to the database using those users per client. Since the users are at DB level, the use of a connection pool should be fine. (Do correct me if I am wrong).
Once connected, any classes are created by those particular users and thus belong only to them.

I am also not sure how good or bad using ORestricted for partitioning would affect performance, but I believe that using it with the specially created users will offer a good data partitioning system.

Scott

SDIPro commented 7 years ago

Hi Scott, thanks for the reply!

Your points are valid and well known. Even when the number of clusters in 3.0 can exceed 32K, we still have to consider the performance impact of having so many. I did some testing about a year ago with 10K clusters and system performance wasn't great. I need to revisit this to see where we can improve things.

Using ORestricted is the current suggested way for partitioning client data, and it works. Unfortunately, it doesn't provide true data separation (storing the records on different devices) nor does it help with limiting the number of records per class as well as the size of the indices on each class.

We shouldn't have to lookup an RID in an index that's used by 10,000 other clients too.

Thanks for the input.

-Colin

smolinari commented 7 years ago

So, in your multitenant scenario, customers would be sharing the same classes too? Interesting. We were planning on giving them their own set of standard classes. That would help avoid the issues you've just mentioned, I believe. Then there is only the physical separation, which theoretically, could be done with sharding of the clusters.

Don't get me wrong, to me that is all a bunch of workarounds. I certainly hope this idea would take catch, as it would make multitenancy a whole lot simpler.

Scott

SDIPro commented 7 years ago

Hi Scott,

Here was my idea. Let's say we define a top-level class called Customer that has a schema defined on it that will be used across all customers.

We'd create a namespace for every customer. You then could do something like this:

insert into HappyCustomer:Customer set shoeSize = "46";

The actual Customer records for 'HappyCustomer' would be stored under the namespace (sub-database). So, querying select from HappyCustomer:Customer would be limited to just the records contained in the namespace's directory for Customer.

This would improve performance greatly and provide data isolation.

The cool thing is that the Customer schema could be used by all the namespaces, so there's only one place to update the class schema for Customer.

You'd also be able to create namespace-specific derived classes, if you wanted.

The main problem to create classes for every customer is trying to maintain it. If you have to create multiple classes for each customer (say HappyCustomerCustomer, HappyCustomerInvoice, etc.) and you have 100s or 1000s or 10,000s of customers, it's a HUGE pain to try to maintain that without some automated tools to do it. It's much cleaner to just create namespaces. We could provide some tools in Studio to manage/search/update namespaces efficiently.

smolinari commented 7 years ago

Could clusters be used for the same thing?

insert into Customer cluster HappyCustomer set shoeSize = "46";

Also, how could customer specific or customer customized schema be added to your idea?

Scott

SDIPro commented 7 years ago

The three problems with using clusters is that we don't currently support creating indices per cluster, you lose polymorphism, and you're losing the benefit of distributing load across multiple clusters on multi-core systems as well as partitioning data for distributed. This should all be automatic without the user having to worry about it.

SDIPro commented 7 years ago

Another cool benefit of a namespace is that we could more easily define for the distributed system where the namespace data resides without having to go down to the cluster level as we do today to define on which server each cluster lives.

Today, if we had 10 distributed nodes and we queried on an index, we'd have to query all 10 to find all related records. By using namespaces, we'd be able to query just the nodes where the namespace partitions live and only query those indices.

smolinari commented 7 years ago

Agreed. I'm sold.

How about customer specific schema?

Scott

SDIPro commented 7 years ago

I think I know what you mean, but mind providing an example?

smolinari commented 7 years ago

Let's say, in your customer class example, that the clients of the system want to add their own properties to the customer class. Is that possible with your multitenant idea?

Scott

SDIPro commented 7 years ago

I haven't given that a lot of thought, but I see no reason why it couldn't, especially if we need to define indices on that class per namespace. I think it should be supported, yes.

a-unite commented 7 years ago

@smolinari I'm not sure if you noticed it, but @SDIPro wrote:

You could still use the top-level database classes (especially with inheritance) but your CRUD ops and queries could be isolated to just the namespace content.

@SDIPro As for me - your original description is excellent and perfectly covering everything.

The only other future I wish we could add here - is a possibility to pin to Namespace right on connection phase. So we will not have to point Namespace for every class call (like Lufthansa:MyClass), but specify our default namespaces just once, adding this specification in one place somewhere at start (like, again, when connecting).

And... (damn it, I started to speak already) moreover, we could define namespaces as more than two level hierarchy.

Then explicit access to meta and data defined by other level namespaces (relative to current default namespace) could be accessible with syntax like Airlines::Lufthansa:MyClass (for namespace two levels under the current one) or @parent::@parent::MyClass (for grandparent of the current namespace) or ::NSLevel1::NSlevel2:MyClass (for second after super-namespace level classes).

And taking into account your security statement:

It would be nice if each namespace could have its own set of users and roles, especially when using ORestricted.

Every namespace could define access rights for users from higher levels namespaces (full access by default), sibling namespaces (no access by default) and so on. Since access could be defined only for already existed levels, write access to upper-level namespace from the current should be prohibited and could be obtained only when connected directly to it.

So, I'd prefer that:

Using the normal "select from MyClass..." would only return the records contained in the top-level directory.

will read as

Using the normal "select from MyClass..." would only return the records contained in the directly inherited and nested (if permitted) to current level namespaces\directories, but not siblings branches.

Using the normal "update Class" would affect meta of the current namespace classes only.

Again, I hope this might make things much easier since all our current code could still work with namespaces, but without a need for massive refactoring.

Thanks, Ata

a-unite commented 7 years ago

Ah, sorry, missed the rest of discussion while was going to write my previous message.... ))

SDIPro commented 7 years ago

Hi @a-unite. Thanks for the comments! Good suggestions. Many months ago, Luigi and I had discussed something similar to your "pin namespace" comment. I think we'd come up with something (in SQL) like "using MyCustomer", as so many other languages use for actual namespaces. But we're open to ideas, and it's a very good concept to include.

Adding namespace support when connecting is also a good idea and would make supporting/enforcing the per-namespace credentials easier and, as you mentioned, to keep existing code working if a namespace is used during the connection step.

I'll have to think about the multi-level namespaces. It crossed my mind in the original design, and then I tossed it because of potential end-user complexity. But, it may be a very good concept (especially your multi-level security example) to support while we're refactoring everything.

a-unite commented 7 years ago

Hi, @SDIPro, thanks!

As for use Namespace - we could have another database level setting "default Namespace" (might be useful anyway) then (or use top-level namespace by default). However, as, I believe, you had already notice, if we are going to change the namespace by invoking use after connection to the database - we will end up with conflict by credentials being local for namespace\subdatabase, so we will need to authorise for local namespace on every use operation.

Apart from the multi-level namespaces as a more general idea (in my opinion), I also hope, that this will help us with additional options for distributed\parallel computing. You have already mentioned this, though.

Anyway, "namespaces" is a must-have feature that could return and sort out what old clusters were supposed to provide in old times when they were able to hold data from many classes, while originally intended to separate data between different files.

SDIPro commented 7 years ago

Hi @a-unite. I agree with everything you said.

We'll definitely have to sort out the use of credentials for each namespace.

Thanks, again.

-Colin

smolinari commented 7 years ago

To avoid needing to call namespaces in the SQL, there would need to be some sort of identification of who is wanting what from ODB, so we might be back to needing ODB's access system and ORestricted in some way, possibly?

The other thing that might help get this done is to provide use cases for non-multitenant purposes. How could namespaces help other customers? I am sure that is something Luca would be looking at.

That is also why I think as much "standard stuff" should be used within ODB to get the data partitioning done. Because, if there are no real use cases for the standard user, it's going to be hard for Luca to put resources on it. It is either that, or show there will be enough business with DBaaS or PaaS to make it worthwhile for ODB.

How about this idea?

Use abstract classes, which can be used across/ above all ORestricted defined classes, as superclasses, but use a special definition at the time of creation of users, to let ODB know, it needs to store the indexes and data in a more partitioned fashion (or maybe it could become the standard?). That way, if an abstract superclass, like the Airlines example, gets a schema change, it affects automatically all the child ORestricted classes beneath it. Just brainstorming here.... 😄

Scott

SDIPro commented 7 years ago

Hi Scott,

Not bad ideas, and I agree that we want to reuse as much existing "standard stuff" as possible to save time, reduce incompatibilities, etc.

I think with namespaces, ORestricted will be used as well, same with security. They'll just be per-namespace.

Abstract classes are one possibility, but I need to be able to create new classes within each namespace that derive from existing classes in the top-level domain.

Keep the ideas coming. :-)

Thanks,

-Colin

smolinari commented 7 years ago

I need to be able to create new classes within each namespace that derives from existing classes in the top-level domain.

Yup, that is exactly what I want too - that these "global" abstract classes would be the parent or super classes to any child classes created by the individual users/clients for their partitions (or namespaces). These abstract global super classes (cool name that is, eh?!!!) would basically only be containers for the standard/ global schema. If the schema of one of these global abstract classes is changed, then all the child classes from the different users (or namespaces) would also get that schema change automatically.

Scott

andrii0lomakin commented 7 years ago

@SDIPro cool idea really support it

devsprint commented 7 years ago

Hi, is there any work done to implement the namespace support or it is included in any future releases?

smolinari commented 7 years ago

@devsprint - make sure you 👍 the first post. 😄

Scott

orientechnologies / orientdb-labs

[OEP_14] Namespaces #14

Summary:

Goals:

Motivation:

Separate Database

Separate Cluster

Unique Identifier

Description:

Implementation

Queries

Inheritance

Data Isolation

Security

Alternatives:

Risks and assumptions:

Impact matrix