Relations and referential integrity in NoSQL databases

bajtos commented 5 years ago

Our current implementation of model relations (has-many, has-one, belongs-to) is deeply rooted in SQL and based on the assumption that the database take care of referential integrity for us.

Example 1: "Customer has many Order instances" and "Order belongs to Customer". When creating a new Order instance, we expect the database to verify that Order.customerId is matching the id value of an existing Customer record. We don't have any reliable & atomic way to do this check at LoopBack side.

Example 2: "Customer has one Credentials instance". When creating a new Credentials instance, we expect the database to verify that there are no other Credentials instances already created for the user. We don't have any reliable & atomic way to do this check at LoopBack side.

SQL databases provide FOREIGN KEY and UNIQUE constraints that work great for this flavor of relations.

The situation becomes more tricky when we try to map this approach to NoSQL databases. Many NoSQL databases do not provide FOREIGN KEY and UNIQUE constraints, this is often a constraint caused by CAP theorem.

For example, it's not possible to enforce UNIQUE constraint when the model data is stored in multiple physical machines and a network partition occurs (a node performing a write operation is not able to reach other nodes because of networking problems, and thus is cannot verify that the new value is not violating uniqueness constraint for records stored on those nodes).

I think we should rethink the way how we are modelling relations and offer different flavors optimized for different backends.

For example, instead of storing a foreign key in the target model, we can store id of related model(s) in the source model and use optimistic locking scheme to enforce the constraints

class Customer {
  // hasMany relation
  orderIds: string[];

  // hasOne relation
  credentialsId: string;
}

// Algorithm for creating a new Order

// 1. Check that the customer exists
const customer = await customerRepo.findById(customerId);

// 2. Create the order
const order = await orderRepo.create(orderData);
try {
  // 3. Add the new order to the customer
  customer.orderIds.push(order.id);
  await customerRepo.replace(customer);
} catch (err) {
  if (/* a conflict occurred, e.g. somebody deleted the Customer */) {
    // 4. Roll back on conflict
    await orderRepo.deleteById(order.id);
  }
  throw err;
}

We can even store the related models as embedded documents, this should work great for Document databases.

class Customer {
  // hasMany relation
  orders: Order[];

  // hasOne relation
  credentials: Credentials;
}

Related issues & discussions:

lb4 relations referential integrity for hasMany/belongsTo #1718

@strongloop/loopback-next @strongloop/loopback-maintainers thoughts?

b-admike commented 5 years ago

@bajtos thank you for starting the discussion around NoSQL backend support for relations; I get the gist of your proposal and since we're not using the LB3 relation engine, it's logical to introduce different flavours of relations for NoSQL DBs.

b-admike commented 5 years ago

I do like to add though that the bulk of the issues that arose with MongoDB and relations in LB4 have to do with the strictObjectIdCoercion flag and how the connector treats ID values. Maybe it's worth it to test it out with cloudant and see the behaviour there as well.

raymondfeng commented 5 years ago

For example, instead of storing a foreign key in the target model, we can store id of related model(s) in the source model and use optimistic locking scheme to enforce the constraints

The referencesMany and referencesOne relations store FKs with the source model. For example:

Customer
  - publicProfileId ( a customer has a public profile)

Customer
  - emailIds ( a customer has multiple emails)

bajtos commented 5 years ago

See my https://github.com/strongloop/loopback-next/issues/1718#issuecomment-446906249. I think we need to look at the bigger picture first.

bajtos commented 5 years ago

Based on @raymondfeng's comment above and the discussion we had elsewhere, I am proposing the following:

Treat 1-N relation as a concept that requires different implementation (schema design) depending on the database used. Explain this in our documentation for relations.
Make it very clear in our documentation that hasMany and belongsTo works best with SQL databases that are able to enforce referential integrity. When these relations are used with NoSQL, referential integrity is not enforced and we end up with a "weak relation"
In the docs for hasMany and belongsTo, explain NoSQL users that hasMany/belongsTo relations are not suited for NoSQL databases and that we are working on a better solution.
Create a new section (or even a new page) in our docs for relations and explain users what relation type to choose depending on which database they use.

For example: if you are building a 1-N relation, if you have SQL - use HasMany, if you have Cloudant - use ReferencesMany, if you have MongoDB - use embedsMany, etc.

We should verify that our recommended solution is actually a good one that works well for the target database. This will most likely require research on our side, as we don't have deep knowledge of different NoSQL databases.
Create a new spike user story for each NoSQL database we have an official connector for. In this spike, we will research what tools are offered for ensuring referential integrity, come up with a proposal on how LoopBack applications should implement 1-N (hasMany), N-1 (belongsTo) and 1-1 (hasOne/belongsTo) relations. For example, we can recommend referencesMany or embedsMany, but also find out that a completely new relation type is needed.
Based on the outcome of these spikes, create user stories to implement the missing relation types identified as needed for the databases we support. These stories should cover both the implementation and documentation updates. For many (if not all) relation types, a spike story to figure out implementation details may be needed first.

I am going to convert this GH issue into an Epic.

bajtos commented 5 years ago

Converting back into issue in favor of a newly created epic #2331

raymondfeng commented 5 years ago

Based on @raymondfeng's comment above and the discussion we had elsewhere, I am proposing the following:

Great summary. +1.

bajtos commented 5 years ago

Treat 1-N relation as a concept that requires different implementation (schema design) depending on the database used. Explain this in our documentation for relations.

Make it very clear in our documentation that hasMany and belongsTo works best with SQL databases that are able to enforce referential integrity. When these relations are used with NoSQL, referential integrity is not enforced and we end up with a "weak relation"

In the docs for hasMany and belongsTo, explain NoSQL users that hasMany/belongsTo relations are not suited for NoSQL databases and that we are working on a better solution.

Moved these steps into a new issue: Document hasMany/belongsTo/hasOne limitations for NoSQL databases #2340

Create a new section (or even a new page) in our docs for relations and explain users what relation type to choose depending on which database they use.

See #2341

Create a new spike user story for each NoSQL database we have an official connector for. In this spike, we will research what tools are offered for ensuring referential integrity, come up with a proposal on how LoopBack applications should implement 1-N (hasMany), N-1 (belongsTo) and 1-1 (hasOne/belongsTo) relations. For example, we can recommend referencesMany or embedsMany, but also find out that a completely new relation type is needed.

Created the following spike stories:

Spike: Strong relations in MongoDB #2342
Spike: Strong relations in Cloudant & CouchDB #2343
Spike: Strong relations in Cassandra #2344

I feel the discussion is over and since we have a list of small actionable tasks to follow up, I am closing this GH issue as done.

loopbackio / loopback-next

Relations and referential integrity in NoSQL databases #2127